I don't really get this. At this point, my limiting factor is not how quickly Claude can self-trudge through code. It's whether Claude is going to do the task correctly or not.
I need more mechanisms for controlling long-running sessions and dynamically injecting my thoughts, correction, and nudges rather than faster ways to burn through my tokens without knowing if the results are going to be correct.
This is my experience. Quantity of output is not the issue right now. Quality is. But I’m not sure if this will ever be solved for, given LLMs are non-deterministic sophisticated autocomplete at their core.
Sure, ‘human in the loop’ and all that jazz, but I feel like my knowledge suffers even with this approach. I have to use llms w pinpoint focus to get decent results.
The original copilot completions behavior might be peak llm performance for coding, sans having an agent write boilerplate and such.
When this is all finished and done, these coding models will allow you to rewrite the linux kernel in rust, recode Kubernetes in assembly, and create your own web framework in 10 min.
But each prompt will cost your company, 10 to 15 million dollars. An extra 20 million if you ask them to review the code and improve the comments.
I think for now it's better to convert tokens into code/library code and then work with that for deterministic results rather than relying on Claude being correct or not.
yes I agree with this, more granular going back, letting me interrupt where it went off the rails, or even editing file reads myself etc would be lovely. Ingesting parts of other conversations would also be cool!
I have heard of "token-maxxing" but I have not heard of "correctness-maxxing" or "quality-maxxing".
The answer for me has been actually more tokens, and create even more layers of automated verification
Dynamic workflows, in my experience, make Claude more effective at complex long-running tasks. They help precisely with getting Claude to do the task correctly.
It feels more like a bespoke build system for the specific task/project than prompting a freeform chat.
I think the theoretical answer here is this:
"Agents address the problem from independent angles, other agents try to refute what they found, and the run keeps iterating until the answers converge."
So you will be supplying the "ground truth" (test suite, detailed spec, whatever) and empower an agent to use it to guide the other agents. Currently a lot of people do this sequentially in the form of multiple code-review passes by fresh agent sessions looking at the work of previous sessions.
Adversarial models are a longstanding technique in ML so it makes sense they would try to go this way.