This isn’t a problem in practice. Most of my prompts ask the LLM to do a bunch of chain of thought before asking them to spit out JSON. I extract the JSON, which works 97.5% of the time, and have a retry step being real specific about “here’s the conversation so far but I need JSON now” that handles the rest. Adding examples really helps.
https://lmsys.org/blog/2024-02-05-compressed-fsm/
I'm not trying to shill sglang specifically, just pointing out that there's a better way, btw.