Alignment is a marketing concept put there to appease stakeholders; it fundamentally can't work...

TuringTest • last Saturday at 8:50 PM • 2 replies • view on HN

Alignment is a marketing concept put there to appease stakeholders; it fundamentally can't work more than at a superficial level.

The model stores all the content on which it is trained in a compressed form. You can change the weights to make it more likely to show the content you ethically prefer; but all the immoral content is also there, and it can resurface with inputs that change the conditional probabilities.

That's why people can make commercial models to circumvent copyright, give instructions for creating drugs or weapons, encourage suicide... The model does not have anything resembling morals; for it all the text is the same, strings of characters that appear when following the generation process.

Replies

pixl97 • last Saturday at 9:18 PM

>Alignment is a marketing concept put there to appease stakeholders

This is a pretty odd statement.

Lets take LLMs alone out of this statement and go with a GenAI style guided humanoid robot. It has language models to interpret your instructions, vision models to interpret the world. Mechanical models to guide its movement.

If you tell this robot to take a knife and cut onions, alignment means it isn't going to take the knife and chop of your wife.

If you're a business, you want a model aligned not to give company secrets.

If it's a health model, you want it to not give dangerous information, like conflicting drugs that could kill a person.

Our LLMs interact with society and their behaviors will fall under the social conventions of those societies. Much like humans LLMs will still have the bad information, but we can greatly reduce the probabilities they will show it.

➕ show 1 reply

idiotsecant • last Saturday at 9:06 PM

I'm not so sure about that. The incorrect answers to just about any given problem are in the problem set as well, but you can pretty reliably predict that the correct answer will be given, granted you have a statistical correlation in the training data. If your training data is sufficiently moral, the outputs will be as well.

➕ show 1 reply

alt Hacker News

Replies