Training doesn’t encourage the intermediate steps to be interpretable. But they are still in the sam...

oofbey • last Sunday at 3:39 AM • 1 reply • view on HN

Training doesn’t encourage the intermediate steps to be interpretable. But they are still in the same token vocabulary space, so you could decode them. But they’ll probably be wrong.

Replies

the8472 • last Sunday at 4:17 AM

token vocabulary space is a hull around human communication (emoji, mathematical symbols, unicode scripts, ...), inside that there's lots of unused representation space that an AI could use to represent internal state. So this seems to be bad idea from an safety/oversight perspective.

https://openai.com/index/chain-of-thought-monitoring/

alt Hacker News

Replies