Does anyone have more benchmarks or evals with data on this topic? The claimed 20% accuracy reduction is significant.
Structured output was one of the lesser known topics that AI consultants and course writers got a lot of mileage out of because it felt like magic. A lot of management people would use ChatGPT but didn’t know how to bridge the text output into a familiar API format, so using a trick to turn it into JSON felt like the missing link. Now that I think about it, I don’t recall seeing any content actually evaluating the impact of constrained output on quality though.
This blog post blurs the lines between output quality reduction and incorrect error handling, though. I’d like to see some more thorough benchmarking that doesn’t try to include obvious schema issues in the quality reduction measurements.
(repeating an earlier comment). The team behind Outlines has repeatedly provided evaluations that show constrained decoding improves the outputs:
- https://blog.dottxt.ai/performance-gsm8k.html
- https://blog.dottxt.ai/oss-v-gpt4.html
- https://blog.dottxt.ai/say-what-you-mean.html