The most detail I've seen of this process is still from OpenAI's postmortem on their sycop...

simonw • yesterday at 7:36 PM • 1 reply • view on HN

The most detail I've seen of this process is still from OpenAI's postmortem on their sycophantic GPT-4o update: https://openai.com/index/expanding-on-sycophancy/

Replies

neom • yesterday at 7:51 PM

I hadn't seen this, thanks for sharing. So basically the reward of the model was to reward the user, and the user used the model to "reward" itself (the user).

Being generous, they poorly implemented/understood how the reward mechanisms abstract and instantiated out to the user such that they become a compounding loop, my understanding was it became particularly true in very long lived conversations.

This makes me want a transparency requirement on how the reward mechanisms in the model I am using at any given moment are considered by whoever built it, so I, the user can consider them also, maybe there is some nuance in "building a safe model" vs "building a model the user can understand the risks around"? Interesting stuff! As always, thanks for publishing very digestible information Simon.

➕ show 1 reply

alt Hacker News

Replies