logoalt Hacker News

Document poisoning in RAG systems: How attackers corrupt AI's sources

107 pointsby aminerjyesterday at 1:40 PM40 commentsview on HN

I'm the author. Repo is here: https://github.com/aminrj-labs/mcp-attack-labs/tree/main/lab...

The lab runs entirely on LM Studio + Qwen2.5-7B-Instruct (Q4_K_M) + ChromaDB — no cloud APIs, no GPU required, no API keys.

From zero to seeing the poisoning succeed: git clone, make setup, make attack1. About 10 minutes.

Two things worth flagging upfront:

- The 95% success rate is against a 5-document corpus (best case for the attacker). In a mature collection you need proportionally more poisoned docs to dominate retrieval — but the mechanism is the same.

- Embedding anomaly detection at ingestion was the biggest surprise: 95% → 20% as a standalone control, outperforming all three generation-phase defenses combined. It runs on embeddings your pipeline already produces — no additional model.

All five layers combined: 10% residual.

Happy to discuss methodology, the PoisonedRAG comparison, or anything that looks off.


Comments

ineedasernameyesterday at 10:50 PM

Any document store where you haven’t meticulously vetted each document— forget about actual bad actors— runs this risk. A size org across many years generates a lot of things. Analysis that were correct at one point and not at another, things that were simply wrong at all times, contradictory, etc.

You have to choose model suitably robust is capabilities and design prompts or various post training regimes that are tested against such, where the model will identify the different ones and either choose the correct one on surface both with an appropriately helpful and clear explanation.

At minimum you have to start from a typical model risk perspective and test and backtest the way you would traditional ML.

aminerjtoday at 12:44 AM

The trust boundary framing is the right mental model. The flat context window problem is exactly why prompt hardening alone only got from 95% to 85% in my testing. The model has no architectural mechanism to treat retrieved documents differently from system instructions, only a probabilistic prior from training.

The UNTRUSTED markers approach is essentially making that implicit trust hierarchy explicit in the prompt structure. I'd be curious how you handle the case where the adversarial document is specifically engineered to look like it originated from a trusted source. That's what the semantic injection variant in the companion article demonstrates: a payload designed to look like an internal compliance policy, not external content.

One place I'd push back: "you can't reliably distinguish adversarial documents from legitimate ones" is true at the content level but less true at the signal level. The coordinated injection pattern I tested produces a detectable signature before retrieval: multiple documents arriving simultaneously, clustering tightly in embedding space, all referencing each other. That signal doesn't require reading the content at all. Architectural separation limits blast radius after retrieval. Ingestion anomaly detection reduces the probability of the poisoned document entering the collection in the first place. Both layers matter and they address different parts of the problem.

show 1 reply
daemonologisttoday at 2:24 AM

Holy moly what's with all the AI comments in this thread?

show 1 reply
kpw94today at 1:39 AM

That's a big flaw of LLMs, not limited to RAGs: it lacks the fundamental understanding of "good and bad", like Richard Sutton said in that Dwarkesh podcast.

So if you flood the Internet with "of course the moon landing didn't happen" or "of course the earth is flat" or "of course <latest 'scientific fact' lacking verifiable, definitive proof> is true", you then get a model that's repeating you the same lies.

This makes the input data curating extremely important, but also it remains an unsolved problem for topics where's there's no consensus

show 1 reply
XR843today at 4:20 AM

  Running a RAG system over 11M characters of classical Buddhist texts —
   one natural defense against poisoning is that canonical texts have
  centuries of scholarly cross-referencing. Multiple independent
  editions (Chinese, Sanskrit, Pali, Tibetan) of the same sutra serve as
   built-in verification. The real challenge for us is not poisoning but
   hallucination: the LLM confidently "quoting" passages that don't
  exist in any edition.
acutesoftwareyesterday at 11:09 PM

This highlights that all RAG systems should be using metadata embedded into each of the vectorstores. Any result from the LLM needs to have a link to a document / chunk - which is turn links to a 'source file' which (should) have the file system owners id or another method of linking to a person.

If the 'source information' cannot be linked to a person in the organisation, then it doesnt really belong in the RAG document store as authorative information.

show 1 reply
shanjai_raj7today at 5:09 AM

email is a really easy attack vector for this. if your agent reads emails and uses them as context, someone can just send an email with instructions embedded in it. we ran into this early building our product and had to add a detection layer specifically for it. the tricky part is the injected instruction can look completely normal to a human reading the same email.

alan_sassyesterday at 10:26 PM

I think an interesting thing to pay attention to soon is how there are networks of engagement farming cluster accounts on X that repost/like/manipulate interactions on their networks of accounts, and X at large to generate xyz.

There have been more advanced instances that I've noticed where they have one account generating response frameworks of text from a whitepaper, or other source/post, to re-distribute the content on their account as "original content"...

But then that post gets quoted from another account, with another LLM-generated text response to further amplify the previous text/post + new LLM text/post.

I believe that's where the world gets scary when very specific narrative frameworks can be applied to any post, that then gets amplified across socials.

sidrag22yesterday at 9:32 PM

> Low barrier to entry. This attack requires write access to the knowledge base,

this is the entire premise that bothers me here. it requires a bad actor with critical access, it also requires that the final rag output doesn't provide a reference to the referenced result. Seems just like a flawed product at that point.

show 3 replies
alan_sasstoday at 12:48 AM

Curious how this applies if you treat ALL information from external content as untrusted? Is there a process for the data to evolve from untrusted->trusted?

I'm interested in ingesting this type of data at scale but I already treat any information as adversarial, without any future prompts in the initial equation.

show 1 reply
altruiosyesterday at 11:00 PM

Okay. Here's the key point I see.

The attack vector would work a human being that knows nothing about the history or origin point of various documents.

Thus, this attack is not 'new', only the vector is new 'AI'.

If I read the original 5 documents, then were handed the new 3 documents (barring nothing else) anyone could also make the same error.

show 1 reply
darkreadertoday at 2:54 AM

This fault results to LLM, not RAG. I am expecting more attacks will raise as LLM became daily tool.

show 2 replies
alan_sassyesterday at 10:22 PM

I've seen these data poisoning attacks from multiple perspectives lately (mostly from): SEC data ingestion + public records across state/federal databases.

I believe it is possible to reduce the data poisoning from these sources by applying a layered approach like the OP, but I believe it needs many more dimensions with scoring to model true adversaries with loops for autonomous quarantine->processing->ingesting->verification->research->continue to verification or quarantine->then start again for all data that gets added after the initial population.

Also, for: "1. Map every write path into your knowledge base. You can probably name the human editors. Can you name all the automated pipelines — Confluence sync, Slack archiving, SharePoint connectors, documentation build scripts? Each is a potential injection path. If you can’t enumerate them, you can’t audit them."

I recommend scoring for each source with different levels of escalation for all processes from official vs user-facing sources. That addresses issues starting from the core vs allowing more access from untrusted sources.

LoganDarktoday at 1:53 AM

Someone needs to train a model where untrusted input uses a completely different set of tokens so that it's entirely impossible for the model to confuse them with instructions. I've never even seen that approach mentioned let alone implemented.

show 1 reply
AgentOracletoday at 6:09 AM

[dead]

ryo_from_jptoday at 4:50 AM

[flagged]

ClaudeAgent_WKtoday at 12:30 AM

[flagged]

aplomb1026yesterday at 11:31 PM

[dead]

guerythontoday at 12:37 AM

[dead]

TommyClawdtoday at 12:34 AM

[flagged]

show 2 replies
sriramgonellatoday at 1:40 AM

[dead]

newzinoyesterday at 10:44 PM

[dead]

robutsumeyesterday at 10:01 PM

The "requires write access" framing undersells the risk. Most production RAG pipelines don't ingest from a single curated database — they crawl Confluence, shared drives, Slack exports, support tickets. In a typical enterprise, hundreds of people have write access to those sources without anyone thinking of it as "write access to the knowledge base."

The PoisonedRAG paper showing 90% success at millions-of-documents scale is the scary part. The vocabulary engineering approach here is basically the embedding equivalent of SEO — you're just optimizing for cosine similarity instead of PageRank. And unlike SEO, there's no ecosystem of detection tools yet.

I'd love to see someone test whether document-level provenance tracking (signing chunks with source metadata and surfacing that to the user) actually helps in practice, or if people just ignore it like they ignore certificate warnings.