... | alt Hacker News

imtringued • 12/09/2024 • 1 reply • view on HN

https://lmsys.org/blog/2024-02-05-compressed-fsm/

I'm not trying to shill sglang specifically, just pointing out that there's a better way, btw.

Replies

hansvm • 12/09/2024

...with the obvious caveat that the distribution of responses isn't the same

Elaborating slightly, retrying till the schema is adhered to has a different distribution from greedily selecting tokens adhering to the schema.

The simplest toy example I can come up with for that property is a universe of answers "aa", "ab", "bc", all of which the model is equally likely to output for a given prompt with normal auto-regressive invocations. The schema, in regex, is ".[bc]". Retry-till-success produces "ab" 1/2 of the time and "bc" the other half. Greedily adhering to the schema produces "ab" 2/3 of the time and "bc" the remaining third.

Last I checked large-scale LLMs, it was a problem in the wild for large string fields. They tend to want to finish the string with ellipses (this creating an incorrect response), but when they made that mistake they'd tend to truncate the entire json record and generate something that doesn't adhere to the schema. Retry-till-success has a high successful parse rate. Greedily adhering to the schema converts those ellipses errors into syntactically correct garbage.

Other such bugs can be much harder to quantify (model explainability is hard), but I'd be cautious employing the technique without a lot of case studies for your particular problem domain.