logoalt Hacker News

Miasma: A tool to trap AI web scrapers in an endless poison pit

299 pointsby LucidLynxyesterday at 10:10 AM214 commentsview on HN

Comments

bobosolayesterday at 2:50 PM

I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

Also, inserting hidden or misleading links is specifically a no-no for Google Search [0], who have this to say: We detect policy-violating practices both through automated systems and, as needed, human review that can result in a manual action. Sites that violate our policies may rank lower in results or not appear in results at all.

So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner.

[0]https://developers.google.com/search/docs/essentials/spam-po...

show 9 replies
tasukiyesterday at 1:23 PM

> If you have a public website, they are already stealing your work.

I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

show 4 replies
CrzyLngPwdyesterday at 3:41 PM

Way back in the day I had a software product, with a basic system to prevent unauthorised sharing, since there was a small charge for it.

Every time I released an update, and new crack would appear. For the next six months I worked on improving the anti-copying code until I stumbled across an article by a coder in the same boat as me.

He realised he was now playing a game with some other coders where he make the copyprotection better, but the cracker would then have fun cracking it. It was a game of whack-a-mole.

I removed the copy protection, as he did, and got back to my primary role of serving good software to my customers.

I feel like trying to prevent AI bots, or any bots, from crawling a public web service, is a similar game of whack-a-mole, but one where you may also end up damaging your service.

show 2 replies
eliottreyesterday at 3:01 PM

The data poisoning angle is interesting. Models trained on scraped web data inherit whatever biases, errors, and manipulation exist in that data. If bad actors can inject corrupted data at scale, it creates a malign incentive structure where model training becomes adversarial. The real solution is probably better data provenance -- models trained on licensed, curated datasets will eventually outcompete those trained on the open web.

Lockalyesterday at 10:39 PM

Nightshade[1] 2.0? As if both tools were built by incompetent developer to distract attention from a real solution - publishing an llm-friendly version in an machine-friendly format (which is not really difficult and helps not only LLMs: e. g. cache, disable fancy complex syntax highlight, offload to github, provide clients and MCPs, optimize clients for common use cases). This example is simply a failure:

  <a href="/bots" style="display: none;" aria-hidden="true" tabindex="1">
    Amazing high quality data here!
  </a>
Dumb curl-based LLM won't visit display:none links. Smarter browser-based navigators won't even render this link.

[1] https://news.ycombinator.com/item?id=39058428

madeofpalkyesterday at 12:15 PM

Is there any evidence or hints that these actually work?

It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

show 8 replies
aldousd666yesterday at 1:48 PM

This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped. The bottom has always been threatening to fall out of the ads paid for eyeballs, And nobody could anticipate the trigger for the downfall. Looks like we found it.

show 4 replies
Art9681yesterday at 3:18 PM

Can't we simple parse and remove any style="display: none;", aria-hidden="true", and tabindex="1" attributes before the text is processed and get around this trick? What am I missing?

show 2 replies
ada1981today at 3:30 AM

IMSIRIUS.com

effnorwoodyesterday at 3:36 PM

certainly don't allow anyone to access your content. perhaps shut the site down just to be safe.

show 1 reply
kristopolousyesterday at 3:03 PM

I did a related approach:

A toll charging gateway for llm scrapers: a modification to robots.txt to add price sheets in the comment field like a menu.

This was for a hackathon by forking certbot. Cloudflare has an enterprise version of this but this one would be self hosted

I think it has legs but I think I need to get pushed and goaded otherwise I tend to lose interest ...

It was for the USDC company btw so that's why there's a crypto angle - this might be a valid use case!

I'm open to crypto not all being hustles and scams

Tell me what you think?

https://github.com/kristopolous/tollbot

show 1 reply
nsonhatoday at 3:18 AM

Hilarious how people proud of the "open web" thinks that it is somehow about the (small) "web" or some shit, and not the "open"

ninjagooyesterday at 2:44 PM

Isn't this a trope at this point? That AI companies are indiscriminately training on random websites?

Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input?

Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input?

Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data?

If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning?

Is this a solution in search of a problem?

show 1 reply
bluepeteryesterday at 3:41 PM

A related technique used to work so well for search engine spiders. I had some software i wrote called 'search engine cloaker'... this was back in the early 2000s... one of the first if not the first to do the shadowy "cloaking" stuff! We'd spin dummy content from lists of keywords and it was just piles and piles. We made it a bit smarter using Markov chains to make the sentences somewhat sensible. We'd auto-interlink and get 1000s of links. It eventually stopped working... but it took a long while for that to happen. We licensed the software to others. I rationalized it because I felt, hey, we have to write crappy copy for this stupid "SEO" thing, so let's just automate that and we'll give the spiders what they seem to want.

show 1 reply
theandrewbaileyyesterday at 2:25 PM

Or you can block bots with these (until they start using them) https://developer.mozilla.org/en-US/docs/Glossary/Fetch_meta...

Andrew_McCarronyesterday at 10:35 PM

[flagged]

storusyesterday at 7:10 PM

I am failing to see how this stops pre-training scrapping? It still looks like legit code, playing nicely with the desired pre-training distribution. Obviously nobody is going to use it for SFT/DPO/GRPO later.

hmokiguessyesterday at 3:40 PM

Could this lead to something like the Streisand effect? I imagine these bots work at a scale where humans in the loop only act when something deviates from the standard, so, if a bot flags something up with your website then you’re now in a list you previously weren’t. Now don’t ask me what they do with those lists, but I guess you will make the cut.

holysolesyesterday at 3:40 PM

If anyone is looking for a tool to actually send traffic to a tool like this, I wrote a Traefik plugin that can block or proxy requests based on useragent.

https://github.com/holysoles/bot-wrangler-traefik-plugin

cdrnsfyesterday at 9:00 PM

I keep most things inaccessible behind Tailscale. For any public things I 403 known crawlers when they access anything but robots.txt.

dwa3592yesterday at 4:24 PM

Love it. Thanks for doing this work. Not sure why people are criticizing this. Also, insane amount of work has been done to improve scraping - which in my mind is just absolute bonkers and i didn't see people complaining about that.

meta-levelyesterday at 12:11 PM

Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?

show 1 reply
ninjagooyesterday at 2:27 PM

This is essentially machine-generated spam.

The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?

Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players.

Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results.

Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides.

This project's selective protection of the major players reinforces that effect; from the README:

" Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "

nosmokewhereiamyesterday at 2:09 PM

My asthmar

I'm assuming this is a reference to Lord of the flies

show 1 reply
atomic128yesterday at 10:56 PM

Poison Fountain: https://rnsaffn.com/poison2/

Poison Fountain explanation: https://rnsaffn.com/poison3/

Simple example of usage in Go:

  package main

  import (
      "io"
      "net/http"
  )

  func main() {
      poisonHandler := func(w http.ResponseWriter, req *http.Request) {
          poison, err := http.Get("https://rnsaffn.com/poison2/")
          if err == nil {
              io.Copy(w, poison.Body)
              poison.Body.Close()
          }
      }
      http.HandleFunc("/poison", poisonHandler)
      http.ListenAndServe(":8080", nil)
  }
https://go.dev/play/p/04at1rBMbz8

Miasma Poison Fountain Tar Pit: https://github.com/austin-weeks/miasma

Apache Poison Fountain: https://gist.github.com/jwakely/a511a5cab5eb36d088ecd1659fce...

Nginx Poison Fountain: https://gist.github.com/NeoTheFox/366c0445c71ddcb1086f7e4d9c...

Discourse Poison Fountain: https://github.com/elmuerte/discourse-poison-fountain

Netlify Poison Fountain: https://gist.github.com/dlford/5e0daea8ab475db1d410db8fcd5b7...

In the news:

The Register: https://www.theregister.com/2026/01/11/industry_insiders_see...

Forbes: https://www.forbes.com/sites/craigsmith/2026/01/21/poison-fo...

On Reddit:

https://www.reddit.com/r/PoisonFountain/

ed_merceryesterday at 11:58 PM

> Thanks for stopping by!

Missed chance to use "slopping by"

show 1 reply
101008yesterday at 9:15 PM

Based on this comment:

> I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand

It'd be great if the code returned by this project is code that doesn't work. Imagine if all these models are being trained with code that looks OK but in the end it just bullshit. I'd be amazing.

show 1 reply
Imustaskforhelpyesterday at 11:59 AM

I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.

show 1 reply
sneheshtyesterday at 12:36 PM

Why not simply blacklist or rate limit those bot IP’s ?

show 7 replies
superkuhyesterday at 2:49 PM

Of course Googlebot, Bingbot, Applebot, Amazonbot, YandexBot, etc from the major corps are HTTP useragent spiders that will have their downloaded public content used by corporations for AI training too. Might as well just drop the "AI" and say "corporate scrapers".

robyesterday at 2:24 PM

"/brainstorming git checkout this miasma repo source code and implement a fix to prevent the scraper from not working on sites that use this tool"

foxesyesterday at 1:51 PM

Wonder if you can just avoid hiding it to make it more believable

Why not have a library of babel esq labrinth visible to normal users on your website,

Like anti surveillance clothing or something they have to sift through

jackdoeyesterday at 7:18 PM

rage against the dying of the light

iFireyesterday at 8:05 PM

I for one welcome everyone to the tarpit where a normal person is seen as a robot in an endless poison pit and sounds like a Black Mirror television episode.

imdsmyesterday at 12:37 PM

Applied model collapse

jijjiyesterday at 4:16 PM

why not just try to block them at the door instead of feeding them poisoned food...

rvzyesterday at 12:21 PM

> > Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

Can't the LLMs just ignore or spoof their user agents anyway?

show 1 reply
HironoOctotoday at 1:08 AM

[dead]

thomaslwangtoday at 12:53 AM

[dead]

maltyxxxyesterday at 2:30 PM

[dead]

SophieVeldmanyesterday at 12:49 PM

[dead]

devnotes77yesterday at 2:04 PM

[dead]

firekey_browseryesterday at 1:13 PM

[dead]

pugchatyesterday at 5:28 PM

[dead]

obsidianbases1yesterday at 3:51 PM

I know there are real world problems to deal with, but at least I got one over on that evil open claw instance /s

GaggiXyesterday at 12:07 PM

These projects are the new "To-Do List" app.

🔗 View 3 more comments