they use multitoken prediction behind the scenes, that might interact with the CoT in a strange way....

sometimelurker • last Sunday at 4:35 PM • 1 reply • view on HN

they use multitoken prediction behind the scenes, that might interact with the CoT in a strange way. maybe for different thinking modes they have different MTP models? if so thats interesting

Replies

pyentropy • last Sunday at 4:38 PM

The number of tokens you predict at time (multi or not) has nothing to do with whether the model wants to emit any, some or a lot of reasoning tokens in reasoning tag -- similar to how branch prediction will not really change the for loop iteration count.

➕ show 1 reply

alt Hacker News

Replies