It's nice to see a paper that confirms what anyone who has practiced using LLM tools already knows very well, heuristically. Keeping your context clean matters, "conversations" are only a construct of product interfaces, they hurt the quality of responses from the LLM itself, and once your context is "poisoned" it will not recover, you need to start fresh with a new chat.
This matches my experience exactly. "poisoned" is a great way to put it. I find once something has gone wrong all subsequent responses are bad. This is why I am iffy on ChatGPT's memory features. I don't notice it causing any huge problems but I don't love how it pollutes my context in ways I don't fully understand.
I've been saying for ages that I want to be able to fork conversations so I can experiment with the direction an exchange takes without irrevocably poisoning a promising well. I can't do this with ChatGPT, is anyone aware of a provider that offers this as a feature?
The #1 tip I teach is to make extensive use of the teeny-tiny mostly hidden “edit” button in ChatGPT and Claude. When you get a bad response, stop and edit to get a better one, rather than letting crap start to multiply crap.
An interesting little example of this problem is initial prompting, which is effectively just a permanent, hidden context that can't be cleared. On Twitter right now, the "Grok" bot has recently begun frequently mentioning "White Genocide," which is, y'know, odd. This is almost certainly because someone recently adjusted its prompt to tell it what its views on white genocide are meant to be, which for a perfect chatbot wouldn't matter when you ask it about other topics, but it DOES matter. It's part of the context. It's gonna talk about that now.
Has any interface implemented a .. history cleaning mechanism? Ie with every chat message focus on cleaning up dead ends in the conversation or irrelevant details. Like summation but organic for the topic at hand?
Most history would remain, it wouldn’t try to summarize exactly, just prune and organize the history relative to the conversation path?
This is why I created FileKitty, which lets you quickly concatenate multiple source code files into markdown-formatted copy-pasta:
https://github.com/banagale/FileKitty
When getting software development assistance, relying on LLM products to search code bases etc leaves too much room for error. Throw in what amounts to lossy compression of that context to save the service provider on token costs and the LLM is serving watered down results.
Getting the specific context right up front and updating that context as the conversation unfolds leads to superior results.
Even then, you do need to mind the length of conversations. I have a prompt designed to capture conversational context, and transfer it into a new session. It identifies files that should be included in the new initial prompt, etc.
For a bit more discussion on this, see this thread and its ancestry: https://news.ycombinator.com/item?id=43711216
Agreed poisoned is a good term. I’d like to see “version control” for conversations via the API and UI that lets you rollback to a previous place or clone from that spot into a new conversation. Even a typo or having to clarify a previous message skews the probabilities of future responses due to the accident.
Yep. I regretted leaving on memory as it is poisoned my conversations with irrelevant junk.
I agree—once the context is "poisoned," it’s tough to recover. A potential improvement could be having the LLM periodically clean or reset certain parts of the context without starting from scratch. However, the challenge would be determining which parts of the context need resetting without losing essential information. Smarter context management could help maintain coherence in longer conversations, but it’s a tricky balance to strike.Perhaps using another agent to do the job?
I mostly just use LLMs for autocomplete (not chat), but wouldn’t this be fixed by adding a “delete message” button/context option in LLM chat UIs?
If you delete the last message from the LLM (so now, you sent the last message), it would then generate a new response. (This would be particularly useful with high-temperature/more “randomly” configured LLMs.)
If you delete any other message, it just updates the LLM context for any future responses it sends (the real problem at hand, context cleanup).
I think seeing it work this way would also really help end users who think LLMs are “intelligent” to better understand that it’s just a big, complex autocomplete (and that’s still very useful).
Maybe this is standard already, or used in some LLM UI? If not, consider this comment as putting it in the public domain.
Now that I’m thinking about it, it seems like it might be practical to use “sub-contextual LLMs” to manage the context of your main LLM chat. Basically, if an LLM response in your chat/context is very long, you could ask the “sub-contextual LLM” to shorten/summarize that response, thus trimming down/cleaning the context for your overall conversation. (Also, more simply, an “edit message” button could do the same, just with you, the human, editing the context instead of an LLM…)
I suppose that the chain-of-thought style of prompting that is used by AI chat applications internally also breaks down because of this phenomenon.
Weirdly it has gotten so far that I have embedded this into my workflow and will often prompt:
> "Good work so far, now I want to take it to another step (somewhat related but feeling it too hard): <short description>. Do you think we can do it in this conversation or is it better to start fresh? If so, prepare an initial prompt for your next fresh instantiation."
Sometimes the model says that it might be better to start fresh, and prepares a good summary prompt (including a final 'see you later'), whereas in other cases it assures me it can continue.
I have a lot of notebooks with "initial prompts to explore forward". But given the sycophancy going on as well as one-step RL (sigh) post-training [1], it indeed seems AI platforms would like to keep the conversation going.
[1] RL in post-training has little to do with real RL and just uses one shot preference mechanisms with an RL inspired training loop. There is very little work in terms of long-term preferences slash conversations, as that would increase requirements exponentially.
>"conversations" are only a construct of product interfaces
This seems to be in flux now due to RL training on multiturn eval datasets so while the context window is evergreen every time, there will be some bias towards interpreting each prompt as part of a longer conversation. Mutliturn post training is not scaled out yet in public but I think it may be the way to keep on the 'double time spent on goal every 7 months curve'
Yes even when coding and not conversing I often start new conversations where I take the current code and explain it new. This often gives better results than hammering on one conversation.
This feels like something that can be fixed with manual instructions which prompt the model to summarize and forget. This might even map appropriately to human psychology. Working Memory vs Narrative/Episodic Memory.
Which is why I really like zed's chat UX experience: being able to edit the full prior conversation like a text file, I can go back and clean it up, do small adjustments, delete turns etc and then continue the discussion with a cleaner and more relevant context.
I have made zed one of my main llm chat interfaces even for non-programming tasks, because being able to do that is great.
One of the most frustrating features of ChatGPT is “memories” which can cause that poisoning to follow you around between chats.
Yarp! And "poisoning" can be done with "off-topic" questions and answers as well as just sort of "dilution". Have noticed this when doing content generation repeatedly, tight instructions get diluted over time.
" 'conversations' are only a construct of product interface" is so helpful maintain top-of-mind, but difficult because of all the "conversational" cues
What surprised me is how early the models start locking into wrong assumptions
And now that chatgpt has a "memory" and can access previous conversations, it might be poisoned permanently. It gets one really bad idea, and forever after it insists on dumping that bad idea into every subsequent response ever after you repeatedly tell it "THAT'S A SHIT IDEA DON'T EVER MENTION THAT AGAIN". Sometimes it'll accidentally include some of its internal prompting, "user is very unhappy, make sure to not include xyz", and then it'll give you a response that is entirely focused around xyz.
My experiences somewhat confirm these observations, but I also had one that was different. Two weeks of debugging IPSEC issues with Gemini. Initially, I imported all the IPSEC documentation from OPNsense and pfSense into Gemini and informed it of the general context in which I was operating (in reference to 'keeping your context clean'). Then I added my initial settings for both sides (sensitive information redacted!). Afterwards, I entered a long feedback loop, posting logs and asking and answering questions.
At the end of the two weeks, I observed that: The LLM was much less likely to become distracted. Sometimes, I would dump whole forum threads or SO posts into it, when it said "this is not what we are seeing here, because of [earlier context or finding]. I eliminated all dead ends logically and informed it of this (yes, it can help with the reflection, but I had to make the decisions). In the end, I found the cause of my issues.
This somewhat confirms what some user here on HN said a few days ago. LLMs are good at compressing complex information into simple one, but not at expanding simple ideas into complex ones. As long as my input was larger than the output (either complexity or length), I was happy with the results.
I could have done this without the LLM. However, it was helpful in that it stored facts from the outset that I had either forgotten or been unable to retrieve quickly in new contexts. It also made it easier to identify time patterns in large log files, which helped me debug my site-to-site connection. I also optimized many other settings along the way, resolving not only the most problematic issue. This meant, in addition to fixing my problem, I learned quite a bit. The 'state' was only occasionally incorrect about my current parameter settings, but this was always easy to correct. This confirms what others already saw: If you know where you are going and treat it as a tool, it is helpful. However, don't try to offload decisions or let it direct you in the wrong direction.
Overall, 350k Tokens used (about 300k words). Here's a related blog post [1] with my overall path, but not directly corresponding to this specific issue. (please don't recommend wireguard; I am aware of it)