While I agree that you must be careful when using structured outputs, the article doesn't provide good arguments:
1. In the examples provided, the author compares freeform CoT + JSON output vs. non-CoT structured output. This is unfair and biases the results towards what they wanted to show. These days, you don't need to include a "reasoning" field in the schema as mentioned in the article; you can just use thinking tokens (e.g., reasoning_effort for OpenAI models). You get the best of both worlds: freeform reasoning and structured output. I tested this, and the results were very similar for both.
2. Let Me Speak Freely? had several methodological issues. I address some of them (and .txt's rebuttal) here: https://dylancastillo.co/posts/say-what-you-mean-sometimes.h...
3. There's no silver bullet. Structured outputs might improve or worsen your results depending on the use case. What you really need to do is run your evals and make a decision based on the data.
BTW, the structured outputs debate is significantly more complicated than even your own post implies.
You aren't testing structured outputs+model alone, you are testing
1. The structured outputs backend used. There are at least 3 major free ones, outlines, xgrammer, lm-format-enforcer and guidance. OpenAI, Anthropic, Google, and Grok will all have different ones. They all do things SIGNIFICANTLY differently. That's at least 8 different backends to compare.
2. The settings used for each structured output backend. Oh, you didn't know that there's often 5+ settings related to how they handle subtle stuff like whitespaces? Better learn to figure out what these settings do and how to tweak them!
3. The models underlying sampling settings, i.e. any default temperature, top_p/top_k, etc going on. Remember that the ORDER of application of samplers matters here! Huggingface transformers and vLLM have opposite defaults on if temperature happens before sampling or after!
4. The model, and don't forget about differences around quants/variants of the model!
Almost no one who does any kinds of these analysis even talk about these additional factors, including academics.
Sometimes it feels like I'm the only one in this world who actually uses this feature at the extremes of its capabilities.