logoalt Hacker News

goodmythicaltoday at 4:04 PM2 repliesview on HN

Isn't this similar to models that have "double check the answer"?

First pass runs your input through, second pass runs it's output as input?

Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?


Replies

sva_today at 4:30 PM

I don't think its mathematically equivalent or even close because the context/logprobs will be very different, since you only produce 1 token per pass. I'd say the token itself has a lot less information than the signal propagating through the residual stream of transformer blocks.

dnhkngtoday at 4:11 PM

Maybe, but the interesting thing for me it this only works with specific 'chunks' of the transformer layer stack. More or less that the optimal leads to worse performance.