Anubis should be something that doesn't inconvenience all the real humans that visit your site.
I work with ffmpeg so I have to access their bugtracker and mailing list site sometimes. Every few days, I'm hit with the Anubis block. And 1/3 - 1/5 of the time, it fails completely. The other times, it delays me by a few seconds. Over time, this has turned me sour on the Anubis project, which was initially something I supported.
> Unfortunately, the price LLM companies would have to pay to scrape every single Anubis deployment out there is approximately $0.00.
The math on the site linked here as a source for this claim is incorrect. The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so. The whole purpose of Anubis is to be expensive for bots that repeatedly request the same site multiple times a second.
I don't like this solution because it is hostile to those who use solutions such as UMatrix / NoScript in their browser, who use TUI browsers (e.g. chawan, lynx, w3m, ...) or who have disabled Javascript outright.
Admittedly, this is no different than the kinds of ways Anubis is hostile to those same users, truly a tragedy of the commons.
Internet in its current form, where I can theoretically ping any web server on earth from my bedroom, doesn't seem sustainable. I think it will have to end at some point.
I can't fully articulate it but I feel like there is some game theory aspect of the current design that's just not compatible with the reality.
Just a personal fact, when I want to see a page and instead I have to face a 3s stupid nagscreens like the one of anubis, I'm very pissed off and pushed even more to bypass the website when possible to get the info I want directly from llm or search engine.
It's kind of a self fulfilling prophecy, you make it the visitor experience worse, giving a self justification why llm giving the content is wanted and needed.
All of that because in the current lambda/cloud computing word, it became very expensive to process only a few requests.
The Caddy config in the parent article uses status code 418. This is cute, but wouldn’t this break search engine indexing? Why not use 307 code?
"Unfortunately, Cloudflare is pretty much the only reliable way to protect against bots."
With footnote:
"I don’t know if they have any good competition, but “Cloudflare” here refers to all similar bot protection services."
That's the crux. Cloudflare is the default, no one seems to bother to take the risk with a competitor for some reason. They seem to exist but when asked people can't even name them.
(For what it's worth I've been using AWS Cloudfront but I had to think a moment to remember its name.)
The problem is that increasingly, they are running JS.
In the ongoing arms race, we're likely to see simple things like this sort of check result in a handful of detection systems that look for "set a cookie" or at least "open the page in headless chrome and measure the cookies."
There are reasons to choose the slightly annoying solution on purpose though. I'm thinking of a political statement along the lines "We have a problem with asshole AI companies and here's how they make everyone's life slightly worse."
So I don't use cloudflare. But only serve clients that support brotli and have a valid cookie. All the actual content comes down an SSE connection. Haven't had any problems with bots on my 5$ VPS.
What I realised recently is for non user browsers my demos are effectively zip bombs.
Why?
Because I stream each frame and each frame is around 180kb uncompressed (compressed frames can be as small as 13bytes). This is fine as the users browser doesn't hold onto the frames.
But, a crawler will hold onto those frames. Very quickly this ends up being a very bad time for them.
Of course there's nothing of value to scrape so mostly pointless. But, I found it entertaining that some scummy crawler is getting nuked by checkboxes [1].
Exactly. I don't understand what computation you can afford to do in 10 seconds on a small number of cores that bots running on large data centers cannot
Here are some benchmarks, TLDR is Anubis is not as performant as an optimized client prover running on the same HEDT CPU.
So the "PoW tax" essentially only applies to low volume requester who have no incentive to optimize or bespoke solution too diverse to optimize at scale.
https://yumechi.jp/en/blog/2025/proof-of-mutex-outspeeding-a...
https://github.com/eternal-flame-AD/pow-buster
The problem was "fixed" but then reverted because the fix has deadlock bug. (Changelog entry: "Remove bbolt actorify implementation due to causing production issues.")
Big picture, why does everyone scrape the web?
Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?
I was briefly messing around with Pangolin, which is supposed to be a self-hosted Cloudflare Tunnels sort of thing. Pretty cool.
One thing I noticed though was that the Digital Ocean Marketplace image asks you if you want to install something called Crowdsec, which is described as a "multiplayer firewall", and while it is a paid service, it appears there is a community offering that is well-liked enough. I actually was really wondering what downsides it has (except for the obvious, which is that you are definitely trading some user privacy in service of security) but at least in principle the idea seems kind of a nice middleground between Cloudflare and nothing if it works and the business model holds up.
> But it still works, right? People use Anubis because it actually stops LLM bots from scraping their site, so it must work, right?
> Yeah, but only because the LLM bots simply don’t run JavaScript.
I don't think that this is the case, because when Anubis itself switched from a proof-of-work to a different JavaScript-based challenge, my server got overloaded, but switching back to the PoW solution fixed it [0].
I also semi-hate Anubis since it required me to add JS to a website that used none before, but (1) it's the only thing that stopped the bot problem for me, (2) it's really easy to deploy, and (3) very few human visitors are incorrectly blocked by it (unlike Captchas or IP/ASN bans that have really high false-positive rates).
It seems that people do NOT understand its already game over.. Lost.. When stuff was small, and we had abusive actors, nobody cared.. oh just few bad actors, nothing to worry about, they will get bored and go away. No, they wont, they will grow and grow and now most even good guys turned bad because there is no punishment for it.. So as I said, game over.
Its time to start do own walled gardens, build overlay VPN networks for humans. Put services there, if someone misbehave? BAN his IP. Came back? BAN again. Came back? wtf? BAN VPN provider.. Just clean the mess.. different networks can peer and exchange. Look, Internet is just network of networks, its not that hard.
This came up before (and this post links to the Tavis Ormandy post that kicked up the last firestorm about Anubis) and without myself shading the intent or the execution on Anubis, just from a CS perspective, I want to say again that the PoW thing Anubis uses doesn't make sense.
Work functions make sense in password hashes because they exploit an asymmetry: attackers will guess millions of invalid passwords for every validated guess, so the attacker bears most (really almost all) of the cost.
Work functions make sense in antispam systems for the same reason: spam "attacks" rely on the cost of an attempt being so low that it's efficient to target millions of victims in the expectation of just one hit.
Work functions make sense in Bitcoin because they function as a synchronization mechanism. There's nothing actually valorous about solving a SHA2 puzzle, but the puzzles give the whole protocol a clock.
Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.
None of this is to say that a serious anti-scraping firewall can't be built! I'm fond of pointing to how Youtube addressed this very similar problem, with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
"Yes, it works, and does so as effectively as Anubis, while not bothering your visitors with a 10-second page load time."
Cool... but I guess now we need a benchmark for such solutions. I don't know the author, I roughly know the problem (as I self host and most of my traffic now comes from AI scrapper bots, not the usual indexing bots or, mind you, humans) but when they are numerous solutions to a multi-dimensional problem I need a common way to compare them.
Yet another solution is always welcomed but without being able to efficiently compare it doesn't help me to pick the right one for me.
What's the endgame of this increasing arms race? A gated web where you need to log in everywhere? Even more captchas and Cloudflare becoming the gateway to the internet? There must be a better way.
We're somehow still stuck with CAPTCHAs (and other challenges), a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
How else would I inter my dead and make sure they get to the afterlife?
Anubis's design is copied from a great botnet protection mechanism - You serve the Javascript cheaply from memory, and then the client is forced to do expensive compute in order to use your expensive compute. This works great at keeping attackers from attempting to waste your time; It turns a 1:1000 amplification in compute costs into a 1000:1.
It is a shitty, and obviously bad solution for preventing scraping traffic. The goal of scraping traffic isn't to overwhelm your site, it's to read it once. If you make it prohibitively expensive to read your site even once, nobody comes to it. If you make it only mildly expensive, nobody scraping cares.
Anubis is specifically DDOS protection, not generally anti-bot, aside from defeating basic bots that don't emulate a full browser. It's been cargo-culted in front of a bunch of websites because of the latter, but it was obviously not going to work for long.
[flagged]
My favourite thing about Anubis is that (in default configuration) it completely bypasses the actual challenge altogether if you set User-Agent header to curl.
E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...
But if you run this, you get the page content straight away:
I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.