logoalt Hacker News

jsheardlast Monday at 3:05 PM2 repliesview on HN

In this case it is actually OpenAI, the IP (74.7.175.182) is in one of their published ranges (74.7.175.128/25).

https://openai.com/searchbot.json

I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.

  $ curl -I https://www.cloudflare.com
  HTTP/2 200

  $ curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
  HTTP/2 403

Replies

btownlast Monday at 5:36 PM

I don't have a statistic here, but I'm always surprised how many websites I come across that do limited user-agent and origin/referrer checks, but don't maintain any kind of active IP based tracking. If you're trying to build a site-specific scraper and are getting blocked, mimicking headers is an easy and often sufficient step.

show 1 reply
Aurornislast Monday at 3:06 PM

Thanks for looking it up!