logoalt Hacker News

taikahessulast Thursday at 3:13 PM4 repliesview on HN

We had our non-profit website drained out of bandwidth and site closed temporarily (!!) from our hosting deal because of Amazon bot aggressively crawling like ?page=21454 ... etc.

Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.

Don't like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don't know.


Replies

jsheardlast Thursday at 3:15 PM

For the "good" bots which at least respect robots.txt you can use this list to get ahead of them before they pummel your site.

https://github.com/ai-robots-txt/ai.robots.txt

There's no easy solution for bad bots which ignore robots.txt and spoof their UA though.

show 3 replies
bee_rideryesterday at 5:22 PM

It is too bad we don’t have a convention already for the internet:

User/crawler: I’d like site

Server: ok that’ll be $.02 for me to generate it and you’ll have to pay $.01 in bandwidth costs, plus whatever your provider charges you

User: What? Obviously as a human I don’t consume websites so fast that $.03 will matter to me, sure, add it to my cable bill.

Crawler: Oh no, I’m out of money, (business model collapse).

show 1 reply
mrweaselyesterday at 8:48 AM

> We had our non-profit website drained out of bandwidth

There is a number of sites which are having issues with scrapers (AI and others) generating so much traffic that transit providers are informing them that their fees will go up with the next contract renewal, if the traffic is not reduced. It's just very hard for the individual sites to do much about it, as most of the traffic stems from AWS, GCP or Azure IP ranges.

It is a problem and the AI companies do not care.

nosioptaryesterday at 10:28 PM

I want better laws. The boot operator should have to pay you damages for taking down your site.

If acting like inconsiderate tools starts costing money, they may stop.