logoalt Hacker News

nickpsecuritylast Friday at 8:49 PM0 repliesview on HN

They mostly imitate patterns in the training material. They do it in response to what gets the reward up for RL training. There's probably lots of examples of both lying and confessions in the training data. So, it should surprise nobody that next, sentence machines fill in a lie or confession in situations similar to ghe training data.

I don't consider that very intelligent or more emergent than other behaviors. Now, if nothing like that was in training data (pure honesty with no confessions), it would be very interesting if it replied with lies and confessions. Because it wasn't pretrained to lie or confess like the above model likely was.