logoalt Hacker News

tjohnellyesterday at 6:33 PM3 repliesview on HN

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.


Replies

gavmoryesterday at 8:16 PM

Something like a textual steganography?

Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'

rotcevyesterday at 6:59 PM

This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.

astrangeyesterday at 7:26 PM

That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

Of course, if you use it to make any decision that can still happen eventually.