I have heard this argument before, but never actually seen concrete evals.
The argument goes that because we are intentionally constraining the model - I believe OAI’s method is a soft max (I think, rusty on my ML math) to get tokens sorted by probability then taking the first that aligns with the current state machine - we get less creativity.
Maybe, but a one-off vibes example is hardly proof. I still use structured output regularly.
Oh, and tool calling is almost certainly implemented atop structured output. After all, it’s forcing the model to respond with a JSON schema representing the tool arguments. I struggle to believe that this is adequate for tool calling but inadequate for general purpose use.
> but never actually seen concrete evals.
The team behind the Outlines library has produced several sets of evals and repeatedly shown the opposite: that constrained decoding improves model performance (including examples of "CoT" which the post claims isn't possible). [0,1]
There was a paper that claimed constrained decoding hurt performance, but it had some fundamental errors which they also wrote about [2].
People get weirdly superstitious when it comes to constrained decoding as though t somehow "limiting the model" when it's just a simple as applying a conditional probably distribution to the logits. I also suspect this post is largely to justify the fact that BAML parses the results (since the post is written by them).
0. https://blog.dottxt.ai/performance-gsm8k.html
1. https://blog.dottxt.ai/oss-v-gpt4.html
2. https://blog.dottxt.ai/say-what-you-mean.html