The Future of Everything Is Lies, I Guess: Safety

170 points • by aphyr • today at 4:23 PM • 81 comments • view on HN

Comments

https://www.researchgate.net/publication/403780821_Adversari...

"Alignment"

In what world would I ever expect a commercial (or governmental) entity to have precise alignment with me personally, or even with my own business? I argue those relationships are necessarily adversarial, and trusting anyone else to align their "AI" tool to my goals, needs, and/or desires is a recipe for having my livelihood completely reassigned into someone else's wallet.

➕ show 3 replies

philipkglass • today at 5:56 PM

In short, the ML industry is creating the conditions under which anyone with sufficient funds can train an unaligned model. Rather than raise the bar against malicious AI, ML companies have lowered it.

This is true, and I believe that the "sufficient funds" threshold will keep dropping too. It's a relief more than a concern, because I don't trust that big models from American or Chinese labs will always be aligned with what I need. There are probably a lot of people in the world whose interests are not especially aligned with the interests of the current AI research leaders.

"Don't turn the visible universe into paperclips" is a practically universal "good alignment" but the models we have can't do that anyhow. The actual refusal-guards that frontier models come with are a lot more culturally/historically contingent and less universal. Lumping them all under "safety" presupposes the outcome of a debate that has been philosophically unresolved forever. If we get hundreds of strong models from different groups all over the world, I think that it will improve the net utility of AI and disarm the possibility of one lab or a small cartel using it to control the rest of us.

➕ show 1 reply

amarant • today at 6:40 PM

There's really only one thing we need to do to avoid the apocalypse, and that is to not hand over the launch codes to a LLM.

Seems easy enough, I'm actually pretty confident in even the most incompetent of current world leaders in this particular task.

➕ show 1 reply

Cynddl • today at 4:32 PM

> "Unavailable Due to the UK Online Safety Act"

Anyone outside the UK can share what this is about?

➕ show 2 replies

ramoz • today at 6:51 PM

Aside from the sentiment and arguments made–

You don't need to train new models. Every single frontier model is susceptible to the same jailbreaks they were 3 years ago.

Only now, an agent reading the CEOs email is much more dangerous because it is more capable than it was 3 years ago.

weinzierl • today at 6:45 PM

Oh boy, that’s a very generous view of human nature.

The cynic in me agrees with the article’s premise, but not because I believe "alignment is a joke", but because I doubt that humans are "biologically predisposed to acquire prosocial behavior."

➕ show 1 reply

macintux • today at 4:35 PM

Previous discussions from earlier posts on the topic:

* https://news.ycombinator.com/item?id=47703528

* https://news.ycombinator.com/item?id=47730981

quantified • today at 6:53 PM

The Garden of Eden story is an apocryphal fable. But it sort of has a relevant twang to it.

Geoffrey Hinton will not have his liver pecked out every day like Prometheus does.

➕ show 1 reply

Imnimo • today at 5:25 PM

>Unlike human brains, which are biologically predisposed to acquire prosocial behavior, there is nothing intrinsic in the mathematics or hardware that ensures models are nice.

How did brains acquire this predisposition if there is nothing intrinsic in the mathematics or hardware? The answer is "through evolution" which is just an alternative optimization procedure.

➕ show 7 replies

nzoschke • today at 5:57 PM

Excellent articles as expected from aphyr.

I'm seeing that these tools are extremely powerful the hands of experts that already understand software engineering, security, observability, and system reliability / safety.

And extremely dangerous in the hands of people that don't understand any of this.

Perhaps reality of economics and safety will kick in, and inexperienced people will stop making expensive and dangerous mistakes.

➕ show 1 reply

cowpig • today at 5:53 PM

> I think it’s likely (at least in the short term) that we all pay the burden of increased fraud: higher credit card fees, higher insurance premiums, a less accurate court system, more dangerous roads, lower wages, and so on.

I think the author is brushing against some larger system issues that are already in motion, and that the way AI is being rolled out are exacerbating, as opposed to a root cause of.

There's a felony fraudster running the executive branch of the US, and it takes a lot of political resources to get someone elected president.

themafia • today at 6:21 PM

> They also build secondary LLMs which double-check that the core LLM is not telling people how to build pipe bombs

Such a fear mongering position. You can learn to build pipe bombs already. Take any chemical reaction that produces gas and heat and contain it. Congratulations, you have a pipe bomb.

Meanwhile.. just.. ask an LLM if you can mix certain cleaning chemicals safely.

> I see four moats that could prevent this from happening.

Really? Because you just said:

> human brains, which are biologically predisposed to acquire prosocial behavior

You think you're going to constrain _human_ behavior by twiddling with the language models? This is foolishly naive to an extreme.

If you put basic and well understood human considerations before corporate ones then reality is far easier to predict.

imbus • today at 4:52 PM

[dead]

simianwords • today at 6:05 PM

The author is still grieving by watching a civilisation changing technology just passing by. Every single one of the problems they note applies to any technology that existed.

The internet produced 4chan. Produced scammers. Produced fraud. Instrumental in spreading child porn. Caused suicides. Many people lost their lives due to bullying on the internet. Many develop have addictions to gaming.

To anyone who has given it some thought, any sufficiently advanced technology usually affects both in good and bad ways. Its obvious that something that increases degrees of freedom in one direction will do so in others. Humans come in and align it.

There's some social credit to gain by being cynical and by signalling this cynicism. In the current social dynamics - being cynical gives you an edge and makes you look savvy. The optimistic appear naive but the pessimists appear as if they truly understand the situation. But the optimists are usually correct in hindsight.

We know how the internet turned out despite pessimists flagging potential problems with it. I know how AI will turn out. These kind of articles will be a dime a dozen and we will look at it the same way as we look at now at bygone internet-pessimists.

This is response not just to this article, but a few others.

➕ show 1 reply

dgfl • today at 5:46 PM

The issue with most of these articles is that they seem to demonize the technology, and systematically use demeaning language about all of its facets. This one raises a lot of important points about LLMs, but the only real conclusion it seems to make is "LLMs are bad! We should never build them!". This is obviously unrealistic. The cat is out of the bag. And we're not _actually_ talking about nuclear weapons here. This technology is useful, and coding agents are just the first example of it. I can easily see a near future where everyone has a Jarvis-like secretary always available; it's only a cost and harness problem. And since this vision is very clear to most who have spent enough time with the latest agents, millions of people across the globe are trying to work towards this.

I do think that safety is important. I'm particularly concerned about vulnerable people and sycophantic behavior. But I think it's better not to be a luddite. I will give a positively biased view because the article already presents a strongly negative stance. Two remarks:

> Alignment is a Joke

True, but for a different reason. Modern LLMs clearly don't have a strong sense of direction or intrinsic goals. That's perfect for what we need to do with them! But when a group of people aligns one to their own interest, they may imprint a stance which other groups may not like (which this article confusingly calls "unaligned model", even though it's perfectly aligned with its creators' intent). People unaligned with your values have always existed and will always exist. This is just another tool they can use. If they're truly against you, they'll develop it whether you want it or not. I guess I'm in the camp of people that have decided that those harmful capabilities are inevitable, as the article directly addresses.

> LLMs change the cost balance for malicious attackers, enabling new scales of sophisticated, targeted security attacks, fraud, and harassment. Models can produce text and imagery that is difficult for humans to bear; I expect an increased burden to fall on moderators.

What about the new scales of sophisticated defenses that they will enable? And for a simple solution to avoid the produced text and imagery: don't go online so much? We already all sort of agree that social media is bad for society. If we make it completely unusable, I think we will all have to gain for it. If digital stops having any value, perhaps we'll finally go back to valuing local communities and offline hobbies for children. What if this is our wakeup call?

➕ show 2 replies

throwway120385 • today at 4:32 PM

At scale I think our society is slowly inching closer and closer to building HM.

➕ show 1 reply

jazzpush2 • today at 4:33 PM

Every one of these posts is immediately pushed to the front page, this one within 4 minutes.

➕ show 3 replies

ibrahimhossain • today at 5:06 PM

Alignment feels like an arms race that favors whoever spends the most on RLHF and red teaming. If even friendly models keep leaking dangerous capabilities, the real moat might be making systems that are fundamentally limited rather than trying to patch every possible failure mode. Interesting piece.

conquera_ai • today at 5:59 PM

Feels like we’re repeating classic distributed systems lessons: assume failure, constrain blast radiusand never trust components that can’t explain themselves reliably

➕ show 1 reply

atleastoptimal • today at 6:22 PM

There really are only 3 options that don't involve human destruction:

1. AI becomes a highly protected technology, a totalitarian world government retains a monopoly on its powers and enforces use, and offers it to those with preexisting connections: permanent underclass outcome

2. Somehow the world agrees to stop building AI and keep tech in many fields at a permanent pre-2026 level: soft butlerian jihad

3. Futurama: somehow we get ASI and a magical balance of weirdness and dance of continual disruption keeps apocalypse in check and we accept a constant steady-state transformation without paperclipocalypse

➕ show 5 replies

alt Hacker News

The Future of Everything Is Lies, I Guess: Safety

Comments