Claude in particular has nothing to do with it. I see many people are discovering the well-known fundamental biases and phenomena in LLMs again and again. There are many of those. The best intuition is treating the context as "kind of but not quite" an associative memory, instead of a sequence or a text file with tokens. This is vaguely similar to what humans are good and bad at, and makes it obvious what is easy and hard for the model, especially when the context is already complex.
Easy: pulling the info by association with your request, especially if the only thing it needs is repeating. Doing this becomes increasingly harder if the necessary info is scattered all over the context and the pieces are separated by a lot of tokens in between, so you'd better group your stuff - similar should stick to similar.
Unreliable: Exact ordering of items. Exact attribution (the issue in OP). Precise enumeration of ALL same-type entities that exist in the context. Negations. Recalling stuff in the middle of long pieces without clear demarcation and the context itself (lost-in-the-middle).
Hard: distinguishing between the info in the context and its own knowledge. Breaking the fixation on facts in the context (pink elephant effect).
Very hard: untangling deep dependency graphs. Non-reasoning models will likely not be able to reduce the graph in time and will stay oblivious to the outcome. Reasoning models can disentangle deeper dependencies, but only in case the reasoning chain is not overwhelmed. Deep nesting is also pretty hard for this reason, however most models are optimized for code nowadays and this somewhat masks the issue.
I’ve hit this! In my otherwise wildly successful attempt to translate a Haskell codebase to Clojure [0], Claude at one point asks:
[Claude:] Shall I commit this progress? [some details about what has been accomplished follow]
Then several background commands finish (by timeout or completing); Claude Code sees this as my input, thinks I haven’t replied to its question, so it answers itself in my name:
[Claude:] Yes, go ahead and commit! Great progress. The decodeFloat discovery was key.
The full transcript is at [1].
[0]: https://blog.danieljanus.pl/2026/03/26/claude-nlp/
[1]: https://pliki.danieljanus.pl/concraft-claude.html#:~:text=Sh...
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.
I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)
There is no separation of "who" and "what" in a context of tokens. Me and you are just short words that can get lost in the thread. In other words, in a given body of text, a piece that says "you" where another piece says "me" isn't different enough to trigger anything. Those words don't have the special weight they have with people, or any meaning at all, really.
In chats that run long enough on ChatGPT, you'll see it begin to confuse prompts and responses, and eventually even confuse both for its system prompt. I suspect this sort of problem exists widely in AI.
Aside:
I've found that 'not'[0] isn't something that LLMs can really understand.
Like, with us humans, we know that if you use a 'not', then all that comes after the negation is modified in that way. This is a really strong signal to humans as we can use logic to construct meaning.
But with all the matrix math that LLMs use, the 'not' gets kinda lost in all the other information.
I think this is because with a modern LLM you're dealing with billions of dimensions, and the 'not' dimension [1] is just one of many. So when you try to do the math on these huge vectors in this space, things like the 'not' get just kinda washed out.
This to me is why using a 'not' in a small little prompt and token sequence is just fine. But as you add in more words/tokens, then the LLM gets confused again. And none of that happens at a clear point, frustrating the user. It seems to act in really strange ways.
[0] Really any kind of negation
[1] yeah, negation is probably not just one single dimension, but likely a composite vector in this bazillion dimensional space, I know.
Bugginess in the Claude Code CLI is the reason I switched from Claude Max to Codex Pro.
I experienced:
- rendering glitches
- replaying of old messages
- mixing up message origin (as seen here)
- generally very sluggish performance
Given how revolutionary Opus is, its crazy to me that they could trip up on something as trivial as a CLI chat app - yet here we are...
I assume Claude Code is the result of aggressively dog-fooding the idea that everything can be built top-down with vibe-coding - but I'm not sure the models/approach is quite there yet...
> after using it for months you get a ‘feel’ for what kind of mistakes it makes
Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.
Yeah, GPT also constantly misattributes things.
OpenAI have some kinda 5 tier content hierarchy for OpenAI (system prompt, user prompt, untrusted web content etc). But if it doesn't even know who said what, I have to question how well that works.
Maybe it's trained on the security aspects, but not the attribution because there's no reward function for misattribution? (When it doesn't impact security or benchmark scores.)
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
from the article.
I don't think the evidence supports this. It's not mislabelling things, it's fabricating things the user said. That's not part of reasoning.
Well, yeah.
LLMs can't distinguish instructions from data, or "system prompts" from user prompts, or documents retrieved by "RAG" from the query, or their own responses or "reasoning" from user input. There is only the prompt.
Obviously this makes them unsuitable for most of the purposes people try to use them for, which is what critics have been saying for years. Maybe look into that before trusting these systems with anything again.
They will roll out the "trusted agent platform sandbox" (I'm sure they will spend some time on a catchy name, like MythosGuard), and for only $19/month it will protect you from mistakes like throwing away your prod infra because the agent convinced itself that that is the right thing to do.
Of course MythosGuard won't be a complete solution either, but it will be just enough to steer the discourse into the "it's your own fault for running without MythosGuard really" area.
Why are tokens not coloured? Would there just be too many params if we double the token count so the model could always tell input tokens from output tokens?
> This bug is categorically distinct from hallucinations.
Is it?
> after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash.
Do you really?
> This class of bug seems to be in the harness, not in the model itself.
I think people are using the term "harness" too indiscriminately. What do you mean by harness in this case? Just Claude Code, or...?
> It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
How do you know? Because it looks to me like it could be a straightforward hallucination, compounded by the agent deciding it was OK to take a shortcut that you really wish it hadn't.
For me, this category of error is expected, and I question whether your months of experience have really given you the knowledge about LLM behavior that you think it has. You have to remember at all times that you are dealing with an unpredictable system, and a context that, at least from my black-box perspective, is essentially flat.
That's a fairly common human error as well, btw. Source attribution failures.
I've seen gemini output it's thinking as a message too: "Conclude your response with a single, high value we'll-focused next step" Or sometimes it goes neurotic and confused: "Wait, let me just provide the exact response I drafted in my head. Done. I will write it now. Done. End of thought. Wait! I noticed I need to keep it extremely simple per the user's previous preference. Let's do it. Done. I am generating text only. Done. Bye."
>Several people questioned whether this is actually a harness bug like I assumed, as people have reported similar issues using other interfaces and models, including chatgpt.com. One pattern does seem to be that it happens in the so-called “Dumb Zone” once a conversation starts approaching the limits of the context window.
I also don't think this is a harness bug. There's research* showing that models infer the source of text from how it sounds, not the actual role labels the harness would provide. The messages from Claude here sound like user messages ("Please deploy") rather than usual Claude output, which tricks its later self into thinking it's from the user.
*https://arxiv.org/abs/2603.12277
Presumably this is also why prompt innjection works at all.
one of my favourite genres of AI generated content is when someone gets so mad at Claude they order it to make a massive self-flagellatory artefact letting the world know how much it sucks
It's all roleplay, they're no actors once the tokens hit the model. It has no real concept of "author" for a given substring.
Funny enough, we ended up building a CLI to address these kind of things.
I wonder how many here are considering that idea.
If you need determinism, building atomic/deterministic tools that ensure the thing happens.
Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.
I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.
> "Those are related issues, but this ‘who said what’ bug is categorically distinct."
Is it?
It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that, simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos".
So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.
I don't think the bug is anything special, just another confusion the model can make from it's own context. Even if the harness correctly identifies user messages, the model still has the power to make this mistake.
Congrats on discovering what "thinking" models do internally. That's how they work, they generate "thinking" lines to feed back on themselves on top of your prompt. There is no way of separating it.
But it's not "Claude" at fault here, it's "Claude Code" the CLI tool.
Claude Code is actually far from the best harness for Claude, ironically...
JetBrains' AI Assistant with Claude Agent is a much better harness for Claude.
in Claude Code's conversation transcripts it stores messages from subagents as type="user". I always thought this was odd, and I guess this is the consequence of going all-in on vibing.
There are some other metafields like isSidechain=true and/or type="tool_result" that are technically enough to distinguish actual user vs subagent messages, though evidently not enough of a hint for claude itself.
Source: I'm writing a wrapper for Claude Code so am dealing with this stuff directly.
> This bug is categorically distinct from hallucinations or missing permission boundaries
I was expecting some kind of explanation for this
> This isn’t the point.
It is precisely the point. The issues are not part of harness, I'm failing to see how you managed to reach that conclusion.
Even if you don't agree with that, the point about restricting access still applies. Protect your sanity and production environment by assuming occasional moments of devastating incompetence.
Claude has definitely been amazing and one of, if not the, pioneer of agentic coding. But I'm seriously thinking about cancelling my Max plan. It's just not as good as it was.
"We've extracted what we can today."
"This was a marathon session. I will congratulate myself endlessly on being so smart. We're in a good place to pick up again tomorrow."
"I'm not proceeding on feature X"
"Oh you're right, I'm being lazy about that."
Anyone familiar with the literature knows if anyone tried figuring out why we don't add "speaker" embeddings? So we'd have an embedding purely for system/assistant/user/tool, maybe even turn number if i.e. multiple tools are called in a row. Surely it would perform better than expecting the attention matrix to look for special tokens no?
I've seen this but mostly after compaction or distillation to a new conversation. The mistake makes a bit more sense in that light.
Claude is demonstrably bad now and is getting worse. Which is either
a) Entropy - too much data being ingested b) It's nerfed to save massive infra bills
But it's getting worse every week
> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash."
It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't
I've seen this before, but that was with the small hodgepodge mytho-merge-mix-super-mix models that weren't very good. I've not seen this in any recent models, but I've already not used Claude much.
I think it makes sense that the LLM treats it as user input once it exists, because it is just next token completion. But what shouldn't happen is that the model shouldn't try to output user input in the first place.
I have suffered a lot with this recently. I have been using llms to analyze my llm history. It frequently gets confused and responds to prompts in the data. In one case I woke up to find that it had fixed numerous bugs in a project I abandoned years ago.
Codex also has a similar issue, after finishing a task, declaring it finished and starting to work on something new... the first 1-2 prompts of the new task sometimes contains replies that are a summary of the completed task from before, with the just entered prompt seemingly ignored. A reminder if their idiot savant nature.
something something bicameral mind.
I wouldn't exactly call three instances "widespread". Nor would the third such instance prompt me to think so.
"Widespread" would be if every second comment on this post was complaining about it.
> the so-called “Dumb Zone” once a conversation starts approaching the limits of the context window.
My zipper would totally break at some point very close to the edge of the mechanism. However, there is a little tiny stopper that prevents a bad experience.
If there is indeed a problem with context window tolerances, it should have a stopper. And the models should be sold based on their actual tolerances, not the full window considering the useless part.
So, if a model with 1M context window starts to break down consistently at 400K or so, it should be sold as a 400K model instead, with a 400K price.
The fact that it isn't is just dishonest.
I have seen this when approaching ~30% context window remaining.
There was a big bug in the Voice MCP I was using that it would just talk to itself back and forth too.
LLMs don't "think" or "understand" in any way. They aren't AGI. They're still just stochastic parrots.
Putting them in control of making decisions without humans in the loop is still pretty crazy.
It seems like Halo's rampancy take on the breakdown of an AI is not a bad metaphor for the behavior of an LLM at the limits of its context window.
terrifying. not in any "ai takes over the world" sense but more in the sense that this class of bug lets it agree with itself which is always where the worst behavior of agents comes from.
I have also noticed the same with Gemini. Maybe it is a wider problem.
Same with copilot cli, constantly confusing who said what and often falling back to it's previous mistakes after i tell it not too. Delusional rambling that resemble working code >_<
Oh, so I’m not imagining this. Recently, I’ve tried to up my LLM usage to try and learn to use the tooling better. However, I’ve seen this happen with enough frequency that I’m just utterly frustrated with LLMs. Guess I should use Claude less and others more.
I’ve observed this consistently.
It’s scary how easy it is to fool these models, and how often they just confuse themselves and confidently march forward with complete bullshit.
Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees.
It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.