It will inevitably learn how to think in a way that translates to one (moral) meaning and back but h...

tjohnell • yesterday at 6:33 PM • 3 replies • view on HN

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.

Replies

gavmor • yesterday at 8:16 PM

Something like a textual steganography?

Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'

rotcev • yesterday at 6:59 PM

This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.

astrange • yesterday at 7:26 PM

That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

Of course, if you use it to make any decision that can still happen eventually.

alt Hacker News

Replies