I'm not 100% convinced by this post. I'd like to see a more extensive formal eval that demonstrates that structured outputs from different providers reduces the quality of data extraction results.
Assuming this holds up, I wonder if a good workaround for this problem - the problem that turning on structured outputs makes errors more likely - would be to do this:
1. Prompt the LLM "extract numbers from this receipt, return data in this JSON format: ..." - without using the structured output mechanism.
2. If the returned JSON does indeed fit the schema then great, you're finished! But if it doesn't...
3. Round-trip the response from the previous call through the LLM again, this time with structured outputs configured. This should give you back the higher quality extracted data in the exact format you want.
(on of the creators of BAML here) yep! exactly!
that workaround we've found works quite well, but the problem is that its not sufficient to just retry in the case of failed schema matches (its both inefficient and also imo incorrect).
Take these two scenarios for example:
Scenario 1. My system is designed to output receipts, but the user does something malicious and gives me an invoice. during step 2, it fails to fit the schema, but then you try with step 3, and now you have a receipt! Its close, but your business logic is not expecting that. Often when schema alignment fails, its usually because the schema was ambiguous or the input was not valid.
Scenario 2. I ask the LLM to produce this schema:
class Person {
name string
past_jobs string[]
}
However the person only has ever worked at 1 job. so the LLM outputs: { "name": "Vaibhav", "past_jobs": "Google" }. Technically since you know you expect an array, you could just transform the string -> string[].thats the algorithm we created: schema-aligned parsing. More here if you're interested: https://boundaryml.com/blog/schema-aligned-parsing
Benchmark wise, when we tested last, it seems to help on top of every model (especially the smaller ones) https://www.reddit.com/r/LocalLLaMA/comments/1esd9xc/beating...
Hope this helps with some of the ambiguities in the post :)
Isn't it better to put it in an agent loop, with the structured output json just specified as a tool? The function call can then just return a summary of the parsed input. We can add in the system prompt a validation step to ask the llm to verify it has provided inputs correctly. This will allow the llm itself to self reflect and correct if needed.