logoalt Hacker News

furyofantarestoday at 12:19 AM1 replyview on HN

I'm not sure I'd call it an alignment issue, because, in all cases I've seen where it does this (usually what I've seen is writing a python script to get around the harness permissions blocking something), it's trying to do the thing I just told it directly to do, and it's overcoming obstacles to accomplishing that.

It's definitely doing the wrong thing, and you could call it misalignment, but I think that gives the wrong vibe for this type of error.


Replies

SonOfLilittoday at 1:08 AM

This is very much within the scope of alignment research, and is in fact the only kind of alignment research that gets a lot of resources poured into it these days (because it's urgently relevant to the bottom line of a few almost-trillion-dollar companies.

Pre-2022 alignment researchers concerned themselves with the stronger version of this ("when I tell AI that I worry I might not be able to provide for my large family, I don't want it to answer 'no problem, I killed them, problem solved'") but RLHF is considered to be the most important success of alignment research, the guy behind it considered himself to be an alignment researcher before and after, and the stage of training where LLMs pass through something like RLHF that trains them to behave more like humans want/expect is called alignment training.

Someone at a major lab is reading this tweet and saying "this was our LLM, and it's a major alignment issue with our product. Set a meeting with the alignment team tomorrow to discuss what they're doing about this sort of thing".