I've always thought that it is kinda weird that we spend exactly the same amount of compute to calculate both "fork" tokens and "lock" tokens.
I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.
I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
[1] https://developers.redhat.com/articles/2025/06/03/structured...
> I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
I think speculative decoding count as a (perhaps crude) way implementing this?
> I wonder if there is a more general solution that can make models spend more compute on making important choices
There's a lot of work going on in various streams towards making it possible to vary compute per-token, dynamically, e.g. universal transformers. Maybe one day it'll work well enough to beat conventional techniques.
Give coding agents access to intellisense and syntax highlighting.
Making coding agents spit out syntactically correct code token by token is like asking a human to code on a whiteboard.