A lot of models have also been overly chat trained. Responding with stuff like “sure I can help you with that”
That’s just unwanted noise if you’re trying to use them as a code building block in an application. So you need to force json or similar…which I suspect harms accuracy over free form
> which I suspect harms accuracy over free form
Untrue in my testing. If you want to use chain of thought, you can always throw in a `thoughts` field (json field/xml tags) before the rest of your output.
This isn’t a problem in practice. Most of my prompts ask the LLM to do a bunch of chain of thought before asking them to spit out JSON. I extract the JSON, which works 97.5% of the time, and have a retry step being real specific about “here’s the conversation so far but I need JSON now” that handles the rest. Adding examples really helps.
I also firmly believe that number of tokens served is a metric that is tracked and encouraged to go up, because more tokens mean more charges. o1 "does more" by using a whole lot more tokens for a very slight bump in usefulness.
I've not had that experience when I include in the prompt for a coding LLM "only respond with the code".
Though it's worth noting that I often do want an explanation, and currently my workflow is to just use a different LLM.
Unfortunately, that "unwanted noise" is a space for the models to compute; trying to eliminate it gives suboptimal responses. What you can do instead is try to corral it - let the model "think" like it wants, but guide it to add markers wrapping the thinking and/or result, then filter out the thinking in UI (for interactive applications) or as an intermediate/post-processing step (for hidden "building blocks").
If you're using Anthropic models, you may actually get improvements from prompting the model to maintain a tagging discipline; see https://docs.anthropic.com/en/docs/build-with-claude/prompt-....