A huge amount of the web is only crawlable with a googlebot user-agent and specific source IPs.
I do not know a lot about this subject, but couldn’t you make a pretty decent index off of common crawl? It seems to me the bar is so low you wouldn’t have to have everything. Especially if your goal was not monetization with ads.
If a crawler offered enough money they could be allowed too. It's not like Google has exclusive crawling rights.
> And given you-know-what, the battle to establish a new search crawler will be harder than ever. Crawlers are now presumed guilty of scraping for AI services until proven innocent.
I have always wondered but how does wayback machine work, is there no way that we can use wayback archive and then run a index on top of every wayback archive somehow?