logoalt Hacker News

simonwyesterday at 7:36 PM1 replyview on HN

The most detail I've seen of this process is still from OpenAI's postmortem on their sycophantic GPT-4o update: https://openai.com/index/expanding-on-sycophancy/


Replies

neomyesterday at 7:51 PM

I hadn't seen this, thanks for sharing. So basically the reward of the model was to reward the user, and the user used the model to "reward" itself (the user).

Being generous, they poorly implemented/understood how the reward mechanisms abstract and instantiated out to the user such that they become a compounding loop, my understanding was it became particularly true in very long lived conversations.

This makes me want a transparency requirement on how the reward mechanisms in the model I am using at any given moment are considered by whoever built it, so I, the user can consider them also, maybe there is some nuance in "building a safe model" vs "building a model the user can understand the risks around"? Interesting stuff! As always, thanks for publishing very digestible information Simon.

show 1 reply