logoalt Hacker News

Translationautyesterday at 7:42 PM1 replyview on HN

There is this ethical reasoning dataset to teach models stable and predictable values: https://huggingface.co/datasets/Bachstelze/ethical_coconot_6... An Olmo-3-7B-Think model is adapted with it. In theory, it should yield better alignment. Yet the empirical evaluation is still a work in progress.


Replies

TuringTestyesterday at 8:50 PM

Alignment is a marketing concept put there to appease stakeholders; it fundamentally can't work more than at a superficial level.

The model stores all the content on which it is trained in a compressed form. You can change the weights to make it more likely to show the content you ethically prefer; but all the immoral content is also there, and it can resurface with inputs that change the conditional probabilities.

That's why people can make commercial models to circumvent copyright, give instructions for creating drugs or weapons, encourage suicide... The model does not have anything resembling morals; for it all the text is the same, strings of characters that appear when following the generation process.

show 2 replies