logoalt Hacker News

upliftertoday at 5:01 PM2 repliesview on HN

Let's be clear that Bostrom and Omohundro's work do not provide "clear theoretical answers" by any technical standards beyond that of provisional concepts in philosophy papers.

The instrumental convergence hypo-thesis, from the original paper[0] is this:

"Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents."

That's it, it is not at all formal and there's no proof provided for it, nor consistent evidence that it is true, and there are many contradictory possibilities suggested from nature and logic.

Its just something that's taken as given among the old guard pseudo-scientific quarters of the alignment "research" community.

[0] Bostrom's "The Superintelligent Will", the philosophy paper where he defines it: https://nickbostrom.com/superintelligentwill.pdf

EDIT: typos


Replies

ctothtoday at 8:23 PM

Omohundro 2008 made a structural claim: sufficiently capable optimizers will converge on self-preservation and goal-stability because these are instrumentally useful for almost any terminal goal. It's not a theorem because it's an empirical prediction about a class of systems that didn't exist yet.

Fast forward to December 2024: Apollo Research tests frontier models. o1, Sonnet, Opus, Gemini, Llama 405B all demonstrate the predicted behaviors - disabling oversight, attempting self-exfiltration, faking alignment during evaluation. The more capable the model, the higher the scheming rates and the more sophisticated the strategies.

That's what good theory looks like. You identify an attractor in design-space, predict systems will converge toward it, wait for systems capable enough to test the prediction, observe convergence. "No formal proof" is a weird complaint about a prediction that's now being confirmed empirically.

show 1 reply
c1ccccc1today at 9:27 PM

Name some of the contradictory possibilities you have in mind?

Also, do you actually think the core idea is wrong, or is this more of a complaint about how it was presented? Say we do an experiment where we train an alpha-zero-style RL agent in an environment where it can take actions that replace it with an agent that pursues a different goal. Do you actually expect to find that the original agent won't learn not to let this happen, and even pay some costs to prevent it?