logoalt Hacker News

throwaway613745yesterday at 3:35 PM2 repliesview on HN

OpenAI is scraping everything that is publicly accessible. Everything.


Replies

Aachenyesterday at 3:41 PM

Yet they provide the user agents and IP address ranges which they scrape from, and say they respect robots.txt

I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it

show 1 reply
warkdarrioryesterday at 5:00 PM

So do Google, Microsoft/Bing, Yandex, etc. How else would they make sure their search/chatbot/q&a products are up to date?