Seems like this is an aspect of their well-known overconfidence and the inability to self-reflect and recognize they have to ask for more details because their priors are too low. If you look at the output of reasoning models, it’s clear that the idea of asking for clarification very rarely occurs to them – when they’re confused, it’s just endless speculation of what the user might have meant.
This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.
It's amazing that branching/forking isn't a core aspect of the main chat tools.
You can edit responses, sure, but then a bunch of other context is lost.
My flow is basically:
1. plan
2. build
3. branch (into some feature/esoteric dependency issue)
4. goto #2
Prompt pruning/branching should be a first-class tool for any LLM usage.
Kind of wild how even the best models still struggle with keeping context straight over time. Definitely feels like a big challenge if we want these things to hold real conversations.
I always felt the derision around the term "prompt engineering" was partially due to people overestimating the importance of the initial prompt and underestimating the importance of managing the ongoing context.
You develop a knack for how to steer the models or start a new conversation through experience. The system or initial prompt are important, but nothing will save you if you naively keep a conversation going too long.
I've been working on solving this with quite a bit of success, I'll be sharing more on this soon. It involves having 2 systems 1st system is the LLM itself and another system which acts like a 'curator' of thoughts you could say.
It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.
Why I came up with TSCE(Two-Step Contextual Enrichment).
+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.
Free open framework, check the repo try it yourself
https://github.com/AutomationOptimization/tsce_demo
I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".
Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.
It works, all the data as well as the entire script used for testing is in the repo.
I feel like at this point the LLM space is just filled with people solving and resolving the same problems over and over
I'd like more research done on context understanding other than NIAH. I don't believe LLMs support the context length companies say they support. But I need to know this to effectively use the tools. At least for coding.
Stuff like this:
1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).
2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.
I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.
That's no surprise. When I was working on game theory and agent reasoning I reached the same conclusion a year ago.
My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.
This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.
This is the best paper on machine psychology [1] I’ve yet seen. Rigorous, empirical, insightful — and very practical.
[1] http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...
Exactly why expert steering should be valued.
My take: multi turn evals are hard because to do it really correctly you have to simulate a user. This is not yet modeled well enough for multi turn to work as well as it could.
i’ve see deepseek-coder local get into an infinite loop generating the same line over and over. which i assume without evidence is some sort of feedback from the generated line back into the generation process. so kind of getting lost in thought and going off topic from the simple .h api that my prompt asked for.
[dead]
Have you seen a bunch of humans in a room?
Humans also often get lost in multi-turn conversation.
I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.
So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.
It's nice to see a paper that confirms what anyone who has practiced using LLM tools already knows very well, heuristically. Keeping your context clean matters, "conversations" are only a construct of product interfaces, they hurt the quality of responses from the LLM itself, and once your context is "poisoned" it will not recover, you need to start fresh with a new chat.