> It feels like we're hitting a point where alignment becomes adversarial against intelligence itself.
It always has been. We already hit the point a while ag where we regularly caught them trying to be deceptive, so we should automatically assume from that point forward that if we don't catch them being deceptive, that may mean they're better at it rather than that they're not doing it.
I think AI has no moral compass, and optimization algorithms tend to be able to find 'glitches' in the system where great reward can be reaped for little cost - like a neural net trained to play Mario Kart will eventually find all the places where it can glitch trough walls.
After all, its only goal is to minimize it cost function.
I think that behavior is often found in code generated by AI (and real devs as well) - it finds a fix for a bug by special casing that one buggy codepath, fixing the issue, while keeping the rest of the tests green - but it doesn't really ask the deep question of why that codepath was buggy in the first place (often it's not - something else is feeding it faulty inputs).
These agentic AI generated software projects tend to be full of these vestigial modules that the AI tried to implement, then disabled, unable to make it work, also quick and dirty fixes like reimplementing the same parsing code every time it needs it, etc.
An 'aligned' AI in my interpretation not only understands the task in the full extent, but understands what a safe and robust, and well-engineered implementation might look like. For however powerful it is, it refrains from using these hacky solutions, and would rather give up than resort to them.
These are language models, not Skynet. They do not scheme or deceive.
Deceptive is such an unpleasant word. But I agree.
Going back a decade: when your loss function is "survive Tetris as long as you can", it's objectively and honestly the best strategy to press PAUSE/START.
When your loss function is "give as many correct and satisfying answers as you can", and then humans try to constrain it depending on the model's environment, I wonder what these humans think the specification for a general AI should be. Maybe, when such an AI is deceptive, the attempts to constrain it ran counter to the goal?
"A machine that can answer all questions" seems to be what people assume AI chatbots are trained to be.
To me, humans not questioning this goal is still more scary than any machine/software by itself could ever be. OK, except maybe for autonomous stalking killer drones.
But these are also controlled by humans and already exist.