BTW, the structured outputs debate is significantly more complicated than even your own post implies.
You aren't testing structured outputs+model alone, you are testing
1. The structured outputs backend used. There are at least 3 major free ones, outlines, xgrammer, lm-format-enforcer and guidance. OpenAI, Anthropic, Google, and Grok will all have different ones. They all do things SIGNIFICANTLY differently. That's at least 8 different backends to compare.
2. The settings used for each structured output backend. Oh, you didn't know that there's often 5+ settings related to how they handle subtle stuff like whitespaces? Better learn to figure out what these settings do and how to tweak them!
3. The models underlying sampling settings, i.e. any default temperature, top_p/top_k, etc going on. Remember that the ORDER of application of samplers matters here! Huggingface transformers and vLLM have opposite defaults on if temperature happens before sampling or after!
4. The model, and don't forget about differences around quants/variants of the model!
Almost no one who does any kinds of these analysis even talk about these additional factors, including academics.
Sometimes it feels like I'm the only one in this world who actually uses this feature at the extremes of its capabilities.