logoalt Hacker News

jordanbyesterday at 9:22 PM3 repliesview on HN

There may be plenty of content out there but everyone with any content on the internet is struggling to keep AI crawlers that they never authorized out. In many cases, people are having to do so just to protect their infrastructure from request spamming.

Since AI crawlers don't obey any consent markers denying access to content, it makes sense for content owners who don't want AI trained on their content to poison it if possible. It's possibly the only way to keep the AI crawlers away.


Replies

Legend2440yesterday at 10:36 PM

I don't think this traffic is actually coming from crawlers for training.

Think about it, why would a training scraper need to hit the same page hundreds of times a day? They only need to download it once.

I think this is LLMs doing web searches at runtime in response to user queries. There's no caching at this level, so similar queries by many different users could lead the LLM to request the same page many times.

dspillettyesterday at 9:39 PM

> It's possibly the only way to keep the AI crawlers away.

Unfortunately that won't work. If you've served them enough content to have noticeable poisoning effect then you've allowed all that load through your resources. It won't stop them coming either - for the most part they don't talk to each other so even if you drive some away more will come, there is no collaborative list of good and bad places to scrape.

The only half-way useful answer to the load issue ATM is PoW tricks like Anubis, and they can inconvenience some of your target audience as well. They don't protect your content at all, once it is copied elsewhere for any reason it'll get scraped from there. For instance if you keep some OSS code off GitHub, and behind some sort of bot protection, to stop it ending up in CoPilot's dataset, someone may eventually fork it and push their version to GitHub anyway thereby nullifying your attempt.

show 1 reply
lxgryesterday at 10:04 PM

If you put something on the open web, as I see it, you only get so much say in what people do with it.

Yes, they can't publish it without attribution and/or compensation (copyright, at least currently, for better or worse). Yes, they shouldn't get to hammer your server with redundant brainless requests for thousands of copies of the same content that no human will ever read (abuse/DDOS prevention).

No, I don't think you get to decide what user agent your visitors are using, and whether that user agent will summarize or otherwise transform it, using LLMs, ad blockers, or 273 artisanal regular expressions enabling dark/bright/readable/pink mode.

> it makes sense for content owners who don't want AI trained on their content to poison it if possible. It's possibly the only way to keep the AI crawlers away.

How would that work? The crawler needs to, well, crawl your site to determine that it's full of slop. At that point, it's already incurred the cost to you.

I'm all for banning spammy, high-request-rate crawlers, but those you would detect via abusive request patterns, and that won't be influenced by tokens.