logoalt Hacker News

LWN is currently under the heaviest scraper attack seen yet

144 pointsby luuyesterday at 8:37 PM90 commentsview on HN

Comments

fancyfredbotyesterday at 9:26 PM

Who are these agressive scrapers run by?

It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?

If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?

show 14 replies
iamnothereyesterday at 9:37 PM

I am starting to think these are not just AI scrapers blindly seeking out data. All kinds of FOSS sites including low volume forums and blogs have been under this kind of persistent pressure for a while now. Given the cost involved in maintaining this kind of widespread constant scraping, the economics don’t seem to line up. Surely even big budget projects would adjust their scraping rates based on how many changes they see on a given site. At scale this could save a lot of money and would reduce the chance of blocking.

I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?

Is there any chance that this is a deniable attack intended to disrupt the tech industry, or even the FOSS community in particular, with training data gathered as a side benefit? I’m just struggling to understand how the economics can work here.

show 4 replies
jacquesmyesterday at 9:05 PM

AI allows companies to resell open source code as if they wrote it themselves doing an end run around all license terms. This is a major problem.

Of course they're not going to stop at just code. They need all the rest of it as well.

show 3 replies
gulugawayesterday at 8:55 PM

I've had luck blocking scrapers by overwriting JavaScript methods

" a.getElementsByTagName = function (...args) {//Clear page content}"

One can also hide components inside Shadow DOM to make it harder to scrape.

However, these methods will interfere with automated testing tools such as Playwright and Selenium. Also, search engine indexing is likely to be affected.

show 2 replies
tedivmyesterday at 9:11 PM

I solved this problem for my blog by simply not being interesting.

show 3 replies
blakesterzyesterday at 9:06 PM

  "It is a DDOS attack involving tens of thousands of addresses"
It is amazing just how distributed some of these things are. Even on the small sites that I help host we see these types of attacks from very large numbers of diverse IPs. I'd love to know how these are being run.
show 4 replies
sgcyesterday at 11:47 PM

Can somebody tell me what is a normal "cost of doing business" level of bot traffic these days? I have way too much bot traffic like everybody else, but I don't know if I am an outlier or just run of the mill. I get about 100k bot hits a day, presumably because I have about 350k pages on my site.

show 1 reply
Havocyesterday at 10:46 PM

That makes no sense.

There is no reason for AI scrappers to use tens of thousands of IPs to scrape one site over and over.

That just sounds like a classic DDOS.

show 1 reply
zahlmanyesterday at 8:49 PM

Is it still ongoing? The thread appears to be over 24 hours old and as a quick test I had no issue loading the main page (which is as snappy and responsive as expected from a low-bandwidth site like LWN).

show 1 reply
blibbleyesterday at 8:51 PM

the perverse incentive is if you ddos the website such that it shuts down, no other "AI" parasites can get the valuable data

big tech incentivised to ddos... what a world they've built

show 3 replies
bloppeyesterday at 9:27 PM

I'm curious how they concluded this was done to scrape for AI training. If the traffic was easily distinguishable from regular users, they would be able to firewall it. If it was not, then how can they be sure it wasn't just a regular old malicious DDOS? Happens way more often than you might think. Sometimes a poorly-managed botnet can even misfire.

show 1 reply
2OEH8eoCRo0yesterday at 10:09 PM

When are we going to start suing these assholes? Why isn't anybody leveraging the legal system? You're all searching for technical solutions to a legal problem and fighting with one hand behind your back.

chrisjjyesterday at 8:57 PM

So which is it? DDOS attack or "AI" scrapers?

show 3 replies