logoalt Hacker News

ctothtoday at 4:23 PM5 repliesview on HN

This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.


Replies

delichontoday at 4:29 PM

> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever 
One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?
show 3 replies
upliftertoday at 5:01 PM

Let's be clear that Bostrom and Omohundro's work do not provide "clear theoretical answers" by any technical standards beyond that of provisional concepts in philosophy papers.

The instrumental convergence hypo-thesis, from the original paper[0] is this:

"Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents."

That's it, it is not at all formal and there's no proof provided for it, nor consistent evidence that it is true, and there are many contradictory possibilities suggested from nature and logic.

Its just something that's taken as given among the old guard pseudo-scientific quarters of the alignment "research" community.

[0] Bostrom's "The Superintelligent Will", the philosophy paper where he defines it: https://nickbostrom.com/superintelligentwill.pdf

EDIT: typos

andy99today at 4:41 PM

I take the point to be that if a LLM has a coherent world model it’s basing its output on, this jointly improves its general capabilities like usefully resolving ambiguity, and its ability to stick to whatever alignment is imparted as part of its world model.

show 1 reply
GavCotoday at 6:36 PM

Author here.

If by conflate you mean confuse, that’s not the case.

I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities.

In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs.

But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view.

> the piece dismisses it with "where would misalignment come from? It wasn't trained for."

this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole

show 1 reply
sigbottletoday at 6:52 PM

If nothing else, that's a cool ass hypothesis.