logoalt Hacker News

jumploopsyesterday at 9:30 PM0 repliesview on HN

I have anecdotal experience here, but I've found more success when solving the task first, and then returning it as JSON in a separate LLM call[0].

Running a single non-reasoning LLM call from source data (text/image/audio in your diagram) to structured JSON seems fragile with the current state of LLMs.

You're essentially asking the model to do two tasks in one pass: parse the input and then format the output. It's amazing it works a lot of the time, but reasonable to assume it won't all of the time.

(As a human, when I'm filling out a complex form, I'll often jump around the document)

Curious how the benchmarks change when you add an intermediary representation, either via reasoning or an additional LLM call. I'd also love to see a comparison with BAML[1].

[0]In my experience we were using structured outputs as part of an agentic state machine, where the JSON contained code snippets (html/js/py/etc.). In the cases where we first prompted the model for the code, and then wrapped it in JSON, we saw much higher quality/success than asking for JSON straightaway.

[1]https://boundaryml.com/