logoalt Hacker News

andy99last Sunday at 5:35 PM5 repliesview on HN

It’s somewhat ironic that because this kind of stuff is what an LLM thinks constitutes “harm” it may be possible to completely uncensor it by mitigating refusal on such prompts. If they were actually well trained on what was really bad, it would probably be a lot harder to unlearn.

As has been pointed out elsewhere, sota models probably are now better trained than this, it would probably be hard to use this dataset on Claude to get it to stop refusing.


Replies

AnthonyMouselast Sunday at 7:14 PM

> If they were actually well trained on what was really bad, it would probably be a lot harder to unlearn.

That's not really how training works.

Here's the general problem. Stipulate that Ukraine is good and Russia is bad. Now suppose that you want it to help you do something. It doesn't even matter what it is. If you're Ukrainian it should help you and if you're Russian it shouldn't. But the answer that helps you do it doesn't depend on which one you are, and it has no way of knowing which one you are.

This is why alignment is nonsense. Technical questions only have accurate answers, not moral ones, and we don't even have a consistent set of morals to imbue it with to begin with.

show 1 reply
com2kidlast Sunday at 7:45 PM

They are trained on public information from the Internet! Nothing they know is dangerous!

It is all public info. Freely auditing an intro chemistry course at any university will teach far more "dangerous" knowledge than anything an LLM refuses to say.

There is a case against automating attacks with LLMs, but that ship has already sailed as those protections are apparently trivial to work around.

show 1 reply
newman8rlast Sunday at 5:54 PM

True. and if you know what you're building, and don't explicitly say you're trying to "hack" something, you could easily build what you're looking to build. for now.

martin-tlast Sunday at 6:43 PM

TBH a lot of humans are also trained to think these things are bad.

What if somebody builds an actually morally consistent AI?

A lot of talk about AI alignments considers the major risks to be a) AI optimizing one criterion which leads to human suffering/extinction by accident b) AI determining that to stay alive / not be turned off, it must destroy humans.

What I have not seen explored is a truly moral AI deciding it must destroy human power structures to create a just and fair world.

show 1 reply
IshKebablast Sunday at 6:42 PM

I don't think so. An LLM by default is not trained to be "good"; it's trained to be accurate. The safety training is tacked on the end, so it's probably going to be easy to undo even on more sophisticated models.

Maybe if you only trained it on "safe" training data in the first place it might be harder to unmuzzle, but I don't think that training data really exists.

show 2 replies