"Sticks to whatever alignment is imparted" assumes what gets imparted is alignment rather ...

ctoth • yesterday at 4:52 PM • 0 replies • view on HN

"Sticks to whatever alignment is imparted" assumes what gets imparted is alignment rather than alignment-performance on the training distribution.

A coherent world model could make a system more consistently aligned. It could also make it more consistently aligned-seeming. Coherence is a multiplier, not a direction.

alt Hacker News