logoalt Hacker News

weisnobodytoday at 7:38 PM0 repliesview on HN

I think the crawled data should have to be shared, but I'm not convinced that Google should have to share their index.

It may be impracticable to share the crawled data, but from the stand point of content providers, having a single entity collecting the information (rather than a bunch of people doing) would seem to be better for everyone. Likely need to have some form of robots.txt which would allow the content provider to indicate how their content could be used (i.e research, web search, AI, etc.).

The people accessing the crawled data would end up paying (reasonable) fees to access the level of data they want, and some portion of that fee would go to the content provider (30% to the crawler and 70% to the crawler? :P maybe).

Maybe even go so far as to allow the Paywalled content providers to set a price on accessing their data for the different purposes. Should they be allowed to pick and choose who within those types should be allowed (or have it be based on violations of the terms of access)

It seems in part the content providers have the following complaints:

  * Too many crawlers (see note below re crawlers)
  * Crawlers not being friendly
  * Improper use of the crawled data
  * Not getting compensated for their content

Why not the index? The index, to me, is where a bunch of the "magic" happens and where individual companies could differentiate themselves from everyone else.

Why can't Microsoft retain Bing traffic when it's the default on stock Windows installs?

  * Do they not have enough crawled data?  
  * Their index isn't very good?
  * Their searching their index isn't good
  * The way they present the data is bad?
  * Google is too entrenched?
  * Combination of the above?

There are several entities intending to crawl all / large portions of the Internet: Baidu, Bing, Brave, Google, DuckDuckGo, Gigablast, Mojeek, Sogou and Yandex [1]. That does not include any of the smaller entities, research projects, etc.

[1] https://en.wikipedia.org/wiki/Search_engine#2000s–present:_P... (2019)