I recently started maintaining a MediaWiki instance for a niche hobbyist community and we'd bee...

lloydatkinson • today at 2:44 PM • 3 replies • view on HN

I recently started maintaining a MediaWiki instance for a niche hobbyist community and we'd been struggling with poor server performance. I didn't set the server up, so came into it assuming that the tiny amount of RAM the previous maintainer had given it was the problem.

Turns out all of the major AI slop companies had been hounding our wiki constantly for months, and this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.

Millions upon millions of requests, hundreds of GB's of bandwidth. Thankfully we're using Cloudflare so could block all of them except real search engine crawlers and now we don't have any problems at all. I also made sure to constrain Apache's limits a bit too.

From what I've read, forums, wikis, git repos are the primary targets of harassment by these companies for some reason. The worst part is these bots could just download a git repo or a wiki dump and do whatever it wants with it, but instead they are designed to push maximum load onto their victims.

Our wiki, in total, is a few gigabytes. They crawled it thousands of times over.

Replies

mrweasel • today at 3:55 PM

For some reason it seems really important to these AI companies to get the very latest version of your pages as well, so they'll do anything in their power to avoid hitting any caching you may try to set up.

toast0 • today at 3:34 PM

> this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.

Ugh, such a weird design. At least my experience has been you are better off setting Apache to always run the same number of instances, and tuning that number as appropriate rather than having the instance count fluctuate under load.

➕ show 1 reply

lithos • today at 3:35 PM

Forum/Wiki content probably more likely to be old enough to be from preAI days, meaning they get to avoid the AI inbreeding problem.

Git content likely to have code for the bot to train on.

alt Hacker News

Replies