logoalt Hacker News

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)

323 pointsby misterchocolatlast Tuesday at 8:42 PM245 commentsview on HN

Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. And not just a little (RIP to your server bill).

There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.

But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.

So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".

The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?

Please try it out! Setup is one component or one import.

(And don't tell me it's a terrible idea because I already know it is)

package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary


Comments

zackmorristoday at 3:26 PM

This is very hacker-like thinking, using tech's biases against it!

I can't help but feel like we're all doing it wrong against scraping. Cloudflare is not the answer, in fact, I think that they lost their geek cred when they added their "verify you are human" challenge screen to become the new gatekeeper of the internet. That must remain a permanent stain on their reputation until they make amends.

Are there any open source tools we could install that detect a high number of requests and send those IP addresses to a common pool somewhere? So that individuals wouldn't get tracked, but bots would? Then we could query the pool for the current request's IP address and throttle it down based on volume (not block it completely). Possibly at the server level with nginx or at whatever edge caching layer we use.

I know there may be scaling and privacy issues with this. Maybe it could use hashing or zero knowledge proofs somehow? I realize this is hopelessly naive. And no, I haven't looked up whether someone has done this. I just feel like there must be a bulletproof solution to this problem, with a very simple explanation as to how it works, or else we've missed something fundamental. Why all the hand waving?

show 3 replies
kstrauseryesterday at 9:08 PM

I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!

I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.

show 4 replies
montroseryesterday at 11:07 PM

This is a cute idea, but I wonder what is the sustainable solution to this emerging fundamental problem: As content publishers, we want our content to be accessible to everyone, and we're even willing to pay for server costs relative to our intended audience -- but a new outsized flood of scrapers was not part of the cost calculation, and that is messing up the plan.

It seems all options have major trade-offs. We can host on big social media and lose all that control and independence. We can pay for outsized infrastructure just to feed the scrapers, but the cost may actually be prohibitive, and seems such a waste to begin with. We can move as much as possible SSG and put it all behind cloudflare, but this comes with vendor lock in and just isn't architecturally feasible in many applications. We can do real "verified identities" for bots, and just let through the ones we know and like, but this only perpetuates corporate control and makes healthy upstart competition (like Kagi) much more difficult.

So, what are we to do?

show 2 replies
cookiengineertoday at 4:20 AM

Remember the 90s when viagra pills and drug recommendations were all over the place?

Yeah, I use that as a safeguard :D The URLs that I don't want to be indexed have hundreds of those keywords that are leading to URLs being deindexed directly. There is also some law in the US that forbids to show that as a result, so Google and Bing are both having a hard time scraping those pages/articles.

Note that this is the latest defense measurement before eBPF blocks. The first one uses zip bombs and the second one uses chunked encoding to blow up proxies so their clients get blocked.

You can only win this game if you make it more expensive to scrape than to host it.

show 1 reply
thethingundoneyesterday at 9:43 PM

I own a forum which currently has 23k online users, all of them bots. The last new post in that forum is from _2019_. Its topic is also very niche. Why are so many bots there? This site should have basically been scraped a million times by now, yet those bots seem to fetch the stuff live, on the fly? I don’t get it.

show 9 replies
voodooEntitytoday at 9:40 AM

Funny idea, some days ago i was really annoyed again by the idea that these AI crawlers still ignore all code licenses and train their models against any github repo no matter what so i quickly hammerd down this

-> https://github.com/voodooEntity/ghost_trap

basically a github action that extends your README.md with a "polymorphic" prompt injection. I run some "llm"s against it and most cases they just produced garbage.

Thought about also creating a JS variant that you can add to your website that will (not visible for the user) also inject such prompt injections to stop web crwaling like you described

darepublictoday at 3:08 PM

Why would I need a dependency for this. I'm being serious. The idea is one thing but why a dependency on react. I say this as someone who uses react. Why not just a paragraph long blog post about the use of porn links and perhaps a small snippet on how to insert one with plain HTML.

n1xis10tyesterday at 12:06 AM

Nice! Reminds me of “Piracy as Proof of Personhood”. If you want to read that one go to Paged Out magazine (at https://pagedout.institute/ ), navigate to issue #7, and flip to page 9.

I wonder if this will start making porn websites rank higher in google if it catches on…

Have you tested it with the Lynx web browser? I bet all the links would show up if a user used it.

Oh also couldn’t AI scrapers just start impersonating Googlebot and Bingbot if this caught on and they got wind of it?

Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.

show 3 replies
deweytoday at 11:00 AM

> user agents and won't show the links to legitimate search engines, so Google and Bing won't see them

Worth noting that in general if you do any "is this Google or not" you should always check by IP address as there's many people spoofing the googlebot user agent.

https://developers.google.com/static/search/apis/ipranges/go...

jt2190today at 3:45 PM

I still don’t understand why a rate-limiting approach is not preferred. Why should I care if the abuse is coming from a bot or the world’s fastest human? Is there a “if you need to rate limit you’ve already lost” issue I’m not thinking of?

show 1 reply
santiagobasultotoday at 12:40 PM

Offtopic: when did js/ts apps get so complicated? I tried to browse the repo and there are so many configuration files and directories for such a simple functionality that should be 1 or 2 modules. It reminds me of the old Java days.

onion2ktoday at 6:12 AM

So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

Unscrupulous AI scrapers will not be using a genuine UA string. They'll be using Google. You'll need to do reverse DNS check instead - https://developers.google.com/crawling/docs/crawlers-fetcher...

show 1 reply
aspherotoday at 1:12 AM

Interesting approach. The scraper-vs-site-owner arms race is real.

On the flip side of this discussion - if you're building a scraper yourself, there are ways to be less annoying:

1. Run locally instead of from cloud servers. Most aggressive blocking targets VPS IPs. A desktop app using the user's home IP looks like normal browsing.

2. Respect rate limits and add delays. Obvious but often ignored.

3. Use RSS feeds when available - many sites leave them open even when blocking scrapers.

I built a Reddit data tool (search "reddit wappkit" if curious) and the "local IP" approach basically eliminated all blocking issues. Reddit is pretty aggressive against server IPs but doesn't bother home connections.

The porn-link solution is creative though. Fight absurdity with absurdity I guess.

show 2 replies
MayeulCtoday at 3:33 PM

Ah, I wonder if corporate proxies will end up flagging your blog as porn, if you protect it this way?

jakub_gtoday at 1:49 PM

> checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

Serving different contents to search engines is called "cloaking" and can get you banned from their indexes.

show 2 replies
xg15yesterday at 10:16 PM

There is some irony in using an AI generated banner image for this project...

(No, I don't want to defend the poor AI companies. Go for it!)

show 1 reply
bytehowltoday at 9:57 AM

Let's imagine I have a blog and put something along these lines somewhere on every page: "This content is provided free of charge for humans to experience. It may also be automatically accessed for search indexing and archival purposes. For licensing information for other uses, contact the author."

If I then get hit by a rude AI scraper, what chances would I have to sue the hell out of them in EU courts for copyright violation (uhh, my articles cost 100k a pop for AI training, actually) and the de facto DDoS attack?

show 1 reply
eek2121today at 1:35 AM

Disclosure, I've not run a website since my health issues began, however, Cloudflare has an AI firewall, Cloudflare is super cheap (also: unsure if the AI firewall is on the free tier, however I would be surprised if it is not). Ignoring the recent drama about a couple incidents they've had (because this would not matter for a personal blog), why not use this instead?

Just curious. Hoping to be able to work on a website again someday, if I ever regain my health/stamina/etc back.

show 2 replies
nkurztoday at 2:41 AM

I was told by the admin of one forum site I use that the vast majority of the AI scraping traffic is Chinese at this point. Not hidden or proxied, but straight from China. Can anyone else confirm this?

Anyway, if it is true, and assuming a forum with minimal genuine Chinese traffic, might a simple approach that injects the porn links only into IP's accessing from China work?

show 3 replies
xgulfietoday at 2:16 PM

Does anyone know if meta name=rating content=adult will also get them to buzz off?

temporallobetoday at 3:33 AM

I do know from my experience with test automation that you can absolutely view a site as human eyes would, essentially ignoring all non-visible elements, and in fact Selenium running with Chrome driver does exactly this. Wouldn’t AI scrapers use similar methods?

show 1 reply
drcleggtoday at 10:47 AM

> So fuzzycanary also checks user agents

I wouldn't be so surprised if they often fake user agents to be honest. Sure, it 'll stop the "more honest" ones (but then, actual honest scrapers would respect robots.txt)

Cool idea though!

reconnectingyesterday at 9:19 PM

I wouldn't recommend to show different versions of the site to search robots, as they probably have mechanisms that track differences, which could potentially lead to a lower ranking or a ban.

show 1 reply
true_religiontoday at 4:31 AM

So, I work for a company that has RTA adult websites. AI bots absolutely do scrape our pages needless of what raunchy material they will find. Maybe they discard it up after ingest, but I can’t tell. There are 1000s of AI bots on the web now from companies big and small so a solution like this will only divert a few scrapers.

megamixtoday at 7:31 AM

Without looking at the src, how does one detect these scrapers? I assume there’s a trade-off somewhere but do the scrapers not fake their headers in the request? Is this a cat-mouse game?

samenameyesterday at 11:22 PM

This is a very creative hack to a common, growing problem. Well done!

Also, I like that you acknowledge it's a bad idea: that gives you more freedom to experiment and iterate.

shadowangeltoday at 11:58 AM

So if the bots use a google useragent it avoids the links?

cuku0078today at 12:08 PM

Why is it so bad that AIs scrape your self-hosted blog?

show 1 reply
yjftsjthsd-hyesterday at 9:03 PM

How does this "look" to a screen reader?

show 1 reply
owl57yesterday at 10:03 PM

> scrapers can ingest them and say "nope we won't scrape there again in the future"

Do all the AI scrapers actually do that?

show 1 reply
montroseryesterday at 11:18 PM

I don't know if I can get behind poisoning my own content in this way. It's clever, and might be a workable practical solution for some, but it's not a serious answer to the problem at hand (as acknowledged by OP).

show 1 reply
docheinestagestoday at 8:53 AM

Reminds me of this "Nathan for You" episode: https://www.youtube.com/watch?v=p9KeopXHcf8

montroseryesterday at 11:23 PM

Reminds me of poisoning bot responses with zip bombs of sorts: https://idiallo.com/blog/zipbomb-protection

show 1 reply
wcarsstoday at 3:09 PM

Singing copyrighted Billy Joel to make your footage unusable for reality television; thanks 30 Rock for an early view into this dystopian strategy

kislotniktoday at 10:18 AM

Funny how the project aims to fight AI scraping, but seems to be using an AI-generated image of a bird?

show 1 reply
admiralrohantoday at 4:35 AM

How do you know whether it is coming from AI scrappers? Do they leave any recognizable footprint?

I am getting lots of noisy traffic since last month and increased my Vercel bill 4x. Not DDoS like, much slower request but not from humans for sure.

taurathyesterday at 10:11 PM

Any other threads on the prevalence and nuisance of scrapers? I didn’t have any idea it was this bad.

show 2 replies
inetknghttoday at 12:48 AM

Porn? Distributed and/or managed by an NPM package?

What could go wrong?

wazooxyesterday at 9:06 PM

Isn't there a risk to get your blog blocked in corporate environment though? If it's a technical blog that would be unfortunate.

show 1 reply
cport1last Wednesday at 12:02 AM

That's a pretty hilarious idea, but in all serious you could use something like https://webdecoy.com/

show 1 reply
MisterTeayesterday at 9:59 PM

> It's you vs the MJs of programming, you're not going to win.

MJs? Michael Jacksons? Right now the whole world, including me, want to know if that means they are bad?

show 2 replies
xenatoday at 1:57 AM

I love this. Please let me know how well it works for you. I may adjust recommendations based on your experiences.

geldedustoday at 3:44 PM

"It's not porn, it's for science" :)))

valenceidratoday at 12:06 AM

Hidden links to porn sites? Lightweights.

show 1 reply
JohnMakinyesterday at 9:33 PM

Cloudflare offers bot mitigation for free, and pretty generous WAF rules that makes mitigations like this seem a little overblown to me

show 4 replies
globalnodeyesterday at 11:23 PM

One solution would be for the SE's to publish their scraper IP's and allow content providers to implement bot exclusion that way. Or even implement an API with crypto credentials that SE's can use to scrape. The solution is waiting for some leadership from SE's unless they want to be blocked as well. If SE's dont want to play perhaps we can implement a reverse directory, like ad blocker but it lists only good/allowed bots instead. Thats a free business idea right there.

edit: I noticed someone mentioned google DOES publish its IP's, there ya go, problem solved.

show 1 reply
username223yesterday at 8:50 PM

The more ways people mess with scrapers, the better -- let a thousand flowers bloom! You as an individual can't compete with VC-funded looters, but there aren't enough of them to defeat a thousand people resisting in different ways.

show 3 replies
efilifeyesterday at 11:47 PM

> Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. ... There isn't much you can do about it without cloudflare

I'm sorry, what? I can't believe I am reading this on HackerNews. All you have to do is code your own, BASIC captcha-like system. You can just create a page that sets a cookie using JS and check on the server whether it exists. 99.9999% of these scrapers can't execute JS and don't support cookies. You can go for a more sophisticated approach and analyze some more scraper tells (like reject short useragents). I do this and NEVER had a bot get past this and not a single user ever complained. It's extremely simple, I should ship this and charge people if no one seems to be able to figure this out by themselves.

show 2 replies
pwlmtoday at 10:39 AM

What prevents AI scrapers from continuing to scrape sites that contain a <Canary> tag but not follow the bad links?

show 1 reply
onetokeoverthetoday at 10:12 AM

[dead]

🔗 View 1 more comment