logoalt Hacker News

Aurornisyesterday at 3:02 PM3 repliesview on HN

This could be OpenAI, or it could be another company using their header pattern.

It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.

Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.

EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.


Replies

jsheardyesterday at 3:05 PM

In this case it is actually OpenAI, the IP (74.7.175.182) is in one of their published ranges (74.7.175.128/25).

https://openai.com/searchbot.json

I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.

  $ curl -I https://www.cloudflare.com
  HTTP/2 200

  $ curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
  HTTP/2 403
show 2 replies
ccgregyesterday at 9:00 PM

> Some search engines provide a list of their scraper IP ranges

Common Crawl's CCBot has published IP ranges. We aren't a search engine (although there are search engines using our data) and we like to describe our crawler as a crawler, not a "scraper".

bobsmoothtoday at 2:33 AM

>The logical next step is for smaller AI players to present themselves as the largest players in the space.

We think we're so different from animals https://en.wikipedia.org/wiki/Mimicry