I'm seeing a lot of comments about how we maintain the status quo, but I'm very interested...

catapart • today at 2:08 PM • 4 replies • view on HN

I'm seeing a lot of comments about how we maintain the status quo, but I'm very interested in hearing from anyone who has conceded that there is no way to stop AI scrapers at this point and what that means for how we maintain public information on the internet in the future.

I don't necessarily believe that we won't find some half-successful solution that will allow server hosting to be done as it currently is, but I'm not very sure that I'll want to participate in whatever schemes come about from it, so I'm thinking more about how I can avoid those schemes rather than insisting that they won't exist/work.

The prevailing thought is that if it's not possible now, it won't be long before a human browser will be indistinguishable from an LLM agent. They can start a GUI session, open a browser, navigate to your page, snapshot from the OS level and backwork your content from the snapshot, or use the browser dev tools or whatever to scrape your page that way. And yes, that would be much slower and more inefficient than what they currently do, but they would only need to do that for those that keep on the bleeding edge of security from AI. For everyone else, you're in a security race against highly-paid interests. So the idea of having something on the public internet that you can stop people from archiving (for whatever purpose they want) seems like it's soon to be an old-fashioned one.

So, taking it as a given that you can't stop what these people are currently trying to stop (without a legislative solution and an enforcement mechanism): how can we make scraping less of a burden on individual hosts? Is this thing going to coalesce into centralizing "archiving" authorities that people trust to archive things, and serve as a much more structured and friendly way for LLMs to scrape? Or is it more likely someone will come up with a way to punish LLMs or their hosts for "bad" behavior? Or am I completely off base? Is anyone actually discussing this? And, if so, what's on the table?

Replies

ronsor • today at 4:24 PM

> without a legislative solution and an enforcement mechanism

If there's one thing people, especially HN users, should've learned by now, it's that there's no enforcement mechanism worth a damn for Internet legislation when incentives don't align.

heavyset_go • today at 3:33 PM

If you don't publish content to the public web anymore, you don't have to worry traffic or scraping or bots

Maybe it'll just be cheaper for CDNs or whatever to sell the data they serve directly instead of doing extra steps with scraping

➕ show 1 reply

suzzer99 • today at 4:18 PM

I don't see this is a permanent problem. Right now there must be 1000s of well-funded AI companies trying to scrape the entire internet. Eventually the AI equity bubble will pop and there will be consolidation. If every player left has already scanned the web, will they need to keep constantly scanning it? Seems like no. Even if they do, there will be a lot less of them.

➕ show 1 reply

titzer • today at 2:12 PM

You're going to hate this, but one answer might be blockchain. A crytographically strong, attestable public record of appending information to a shared repository. Combined with cryptographic signatures for humans, it's basically a secure, open git repository for human knowledge.

➕ show 4 replies

alt Hacker News

Replies