logoalt Hacker News

ctothyesterday at 4:52 PM0 repliesview on HN

"Sticks to whatever alignment is imparted" assumes what gets imparted is alignment rather than alignment-performance on the training distribution.

A coherent world model could make a system more consistently aligned. It could also make it more consistently aligned-seeming. Coherence is a multiplier, not a direction.