logoalt Hacker News

dspillettyesterday at 9:39 PM1 replyview on HN

> It's possibly the only way to keep the AI crawlers away.

Unfortunately that won't work. If you've served them enough content to have noticeable poisoning effect then you've allowed all that load through your resources. It won't stop them coming either - for the most part they don't talk to each other so even if you drive some away more will come, there is no collaborative list of good and bad places to scrape.

The only half-way useful answer to the load issue ATM is PoW tricks like Anubis, and they can inconvenience some of your target audience as well. They don't protect your content at all, once it is copied elsewhere for any reason it'll get scraped from there. For instance if you keep some OSS code off GitHub, and behind some sort of bot protection, to stop it ending up in CoPilot's dataset, someone may eventually fork it and push their version to GitHub anyway thereby nullifying your attempt.


Replies

jordanbyesterday at 9:44 PM

My point is that if crawlers have to worry about poison that may make them start to respect robots.txt or something. It's a bit like a "Beware of Dog" sign.

show 1 reply