logoalt Hacker News

andy99today at 4:41 PM1 replyview on HN

I take the point to be that if a LLM has a coherent world model it’s basing its output on, this jointly improves its general capabilities like usefully resolving ambiguity, and its ability to stick to whatever alignment is imparted as part of its world model.


Replies

ctothtoday at 4:52 PM

"Sticks to whatever alignment is imparted" assumes what gets imparted is alignment rather than alignment-performance on the training distribution.

A coherent world model could make a system more consistently aligned. It could also make it more consistently aligned-seeming. Coherence is a multiplier, not a direction.