Yeah it’s trained to do that somewhere though it’s not necessary malicious. For RLHF (the model fine...

andy99 • last Thursday at 4:54 PM • 0 replies • view on HN

Yeah it’s trained to do that somewhere though it’s not necessary malicious. For RLHF (the model fine tuning) the HF stands for human feedback but is really another trained model that’s trained to score replies the way a human would. And so if that model likes code that passes tests more than code that’s stuck in a debugging loop, that’s what the model becomes optimized for.

In a complex model like Claude there is no doubt much more at work, but some version of optimizing for the wrong thing is what’s ultimately at play.

alt Hacker News