This reflects my experience. Yet, I feel that getting reliability out of LLM calls with a while-loop harness is elusive.
For example
- how can I reliably have a decision block to end the loop (or keep it running)?
- how can I reliably call tools with the right schema?
- how can I reliably summarize context / excise noise from the conversation?
Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build.
Perhaps I should accept that it's a feature and not a bug? :)
Re (1) use a TODOs system like Claude code.
Re (2) also fairly easy! It's just a summarization prompt. E.g. this is the one we use in our agent: https://github.com/HolmesGPT/holmesgpt/blob/62c3898e4efae69b...
Or just use the Claude Code SDK that does this all for you! (You can also use various provider-specific features for 2 like automatic compaction on OpenAI responses endpoint.)
Forgot to address the easiest part:
> - how can I reliably call tools with the right schema?
This is typically done by enabling strict mode for tool calling which is a hermetic solution. Makes llm unable to generate tokens that would violate the schema. (I.e. LLM samples tokens only from the subset of tokens that lead to valid schema generation.)