Facebook's Fascination with My Robots.txt

68 points • by Ndymium • today at 12:01 PM • 38 comments • view on HN

Comments

1. Put a note in robots.txt that says

"By accessing this file more than one time per second you agree to pay a fee of $0.1 per access plus an additional $0.1 for each previous access each day. This fee will be charged on a per access basis."

2. Run a program that logs the number for Facebook requests and prints a summary and bill.

2. Then get a stamp, envelope and write out a bill for the first day, call it a demand for payment and send it to:

Facebook, Inc. Attn: Security Department/Custodian of Records 1601 S. California Avenue Palo Alto, CA 94304 U.S.A.

You can optionally send this registered mail, where someone has to sign for it.

Corporations such as FaceBook are used to getting their way in court because they can afford lawyers and you cannot. So they have gotten lazy and do not worry about what is fair or legal.

So take them to court when you have a legitimate legal issue. The courts are there to provide redress when you are aggrieved. Right? Use the courts. You can file a small claims action easily. Just make sure you have 1) a legitimate case, 2) evidence 3) have sent them a demand for payment.

➕ show 4 replies

Nextgrid • today at 1:40 PM

> Perhaps someone at their end screwed up a loop conditional, but you'd think some monitoring dashboard somewhere would have a warning pop up because of this.

If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.

➕ show 4 replies

prmoustache • today at 3:41 PM

Is there any downside in just blocking the whole META IP range? I mean they aren't even running a search engine AFAIK. Why would I want them to crawl my website?

xg15 • today at 1:26 PM

Facebook just decided that instead of loading the robots.txt for every host they intend to crawl, they'll just ignore all the other robots.txt files and then access this one a million times to restore the average.

➕ show 1 reply

Ndymium • today at 12:01 PM

For some reason, Facebook has been requesting my Forgejo instance's robots.txt in a loop for the past few days, currently at a speed of 7700 requests per hour. The resource usage is negligible, but I'm wondering why it's happening in the first place and how many other robot files they're also requesting repeatedly. Perhaps someone at Meta broke a loop condition.

➕ show 1 reply

tananaev • today at 1:44 PM

Maybe they’re trying to DDoS it, and once an error is returned, they assume that no robots.txt file exists and then crawl everything else on the site?

➕ show 1 reply

dormento • today at 1:56 PM

Has anyone done research on the topic of trying to block these bots by claiming to host illegal material or talking about certain topics? I mean having a few entries in your robots like "/kill-president", "/illegal-music-downloads", "/casino-lucky-tiger-777" etc.

➕ show 2 replies

13pixels • today at 3:04 PM

Facebook is honestly the least interesting crawler misbehaving right now. The real shift is GPTBot, ClaudeBot, PerplexityBot and a dozen other AI crawlers that don't even identify themselves half the time.

I've been monitoring server logs across ~150 sites and the pattern is striking: AI crawler traffic increased roughly 8x in the last 12 months, but most site owners have no idea because it doesn't show up in analytics. The bots read everything, respect robots.txt maybe 60% of the time, and the content they index directly shapes what ChatGPT or Perplexity recommends to users.

The irony is that robots.txt was designed for a world where crawling meant indexing for search results. Now crawling means training data and real-time retrieval for AI answers. Completely different power dynamic and most robots.txt files haven't adapted.

VladVladikoff • today at 3:05 PM

My bet is this is a threading bug rather than just a broken loop. Somehow the threads are failing to communicate with each other, or some sort of race condition, so it keeps putting in the same task to the queue but missing the result. Something like that.

petee • today at 2:39 PM

Do crawlers follow/cache 301 permanent redirects? I wonder if you could point the firehouse back at facebook, but it would mean they wouldn't get your robots.txt anymore (though I'd just blackhole that whole subnet anyway)

evv • today at 1:51 PM

Have you considered serving a zip bomb to this user agent?

➕ show 1 reply

matja • today at 1:17 PM

Did you try adding a Cache-Control response header?

➕ show 2 replies

mghackerlady • today at 2:29 PM

>my extreme LibreOffice Calc skillz

How does one learn these skills, I can see them being useful in the future

lloydatkinson • today at 2:44 PM

I recently started maintaining a MediaWiki instance for a niche hobbyist community and we'd been struggling with poor server performance. I didn't set the server up, so came into it assuming that the tiny amount of RAM the previous maintainer had given it was the problem.

Turns out all of the major AI slop companies had been hounding our wiki constantly for months, and this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.

Millions upon millions of requests, hundreds of GB's of bandwidth. Thankfully we're using Cloudflare so could block all of them except real search engine crawlers and now we don't have any problems at all. I also made sure to constrain Apache's limits a bit too.

From what I've read, forums, wikis, git repos are the primary targets of harassment by these companies for some reason. The worst part is these bots could just download a git repo or a wiki dump and do whatever it wants with it, but instead they are designed to push maximum load onto their victims.

Our wiki, in total, is a few gigabytes. They crawled it thousands of times over.

➕ show 3 replies

alt Hacker News

Facebook's Fascination with My Robots.txt

Comments