A huge amount of the web is only crawlable with a googlebot user-agent and specific source IPs.

walls • yesterday at 7:32 PM • 4 replies • view on HN

Replies

Imustaskforhelp • yesterday at 8:22 PM

> And given you-know-what, the battle to establish a new search crawler will be harder than ever. Crawlers are now presumed guilty of scraping for AI services until proven innocent.

I have always wondered but how does wayback machine work, is there no way that we can use wayback archive and then run a index on top of every wayback archive somehow?

➕ show 1 reply

deepsquirrelnet • yesterday at 8:31 PM

I do not know a lot about this subject, but couldn’t you make a pretty decent index off of common crawl? It seems to me the bar is so low you wouldn’t have to have everything. Especially if your goal was not monetization with ads.

➕ show 1 reply

charcircuit • yesterday at 9:16 PM

If a crawler offered enough money they could be allowed too. It's not like Google has exclusive crawling rights.

➕ show 1 reply

alt Hacker News

Replies