Training doesn’t encourage the intermediate steps to be interpretable. But they are still in the same token vocabulary space, so you could decode them. But they’ll probably be wrong.
token vocabulary space is a hull around human communication (emoji, mathematical symbols, unicode scripts, ...), inside that there's lots of unused representation space that an AI could use to represent internal state.
So this seems to be bad idea from an safety/oversight perspective.
token vocabulary space is a hull around human communication (emoji, mathematical symbols, unicode scripts, ...), inside that there's lots of unused representation space that an AI could use to represent internal state. So this seems to be bad idea from an safety/oversight perspective.
https://openai.com/index/chain-of-thought-monitoring/