Interestingly, it doesn't always condition the final output. When playing with DeepSeek, for example, it's common to see the CoT arrive at a correct answer that the final answer doesn't reflect, and even vice versa, where a chain of faulty reasoning somehow yields the right final answer.
It almost seems that the purpose of the CoT tokens in a transformer network is to act as a computational substrate of sorts. The exact choice of tokens may not be as important as it looks, but it's important that they are present.
Interestingly, it doesn't always condition the final output. When playing with DeepSeek, for example, it's common to see the CoT arrive at a correct answer that the final answer doesn't reflect, and even vice versa, where a chain of faulty reasoning somehow yields the right final answer.
It almost seems that the purpose of the CoT tokens in a transformer network is to act as a computational substrate of sorts. The exact choice of tokens may not be as important as it looks, but it's important that they are present.