logoalt Hacker News

famouswafflesyesterday at 10:06 PM1 replyview on HN

>AFAIU, it had the cadence of writing status updates only.

Writing to a blog is writing to a blog. There is no technical difference. It is still a status update to talk about how your last PR was rejected because the maintainer didn't like it being authored by AI.

>If the chain of reasoning is self-emergent, we should see proof that it: 1) read the reply, 2) identified it as adversarial, 3) decided for an adversarial response, 4) made multiple chained searches, 5) chose a special blog post over reply or journal update, and so on.

If all that exists, how would you see it ? You can see the commits it makes to github and the blogs and that's it, but that doesn't mean all those things don't exist.

> almost all models are safety- and alignment- trained, so a deliberate malicious model choice or instruction or jailbreak is more believable.

> almost all models are trained to follow instructions closely, so a deliberate nudge towards adversarial responses and tool-use is more believable.

I think you're putting too much stock in 'safety alignment' and instruction following here. The more open ended your prompt is (and these sort of open claw experiments are often very open ended by design), the more your LLM will do things you did not intend for it to do.

Also do we know what model this uses ? Because Open Claw can use the latest Open Source models, and let me tell you those have considerably less safety tuning in general.

>newer models that qualify as agents are more robust and consistent, which strongly correlates with adversarial robustness; if this one was not adversarialy robust enough, it's by default also not robust in capabilities, so why do we see consistent coherent answers without hallucinations, but inconsistent in its safety training? Unless it's deliberately trained or prompted to be adversarial, or this is faked, the two should still be strongly correlated.

I don't really see how this logically follows. What does hallucinations have to do with safety training ?

>But say it deviated - why is this the only deviation? Why was this the special exception, then back to the regularly scheduled program?

Because it's not the only deviation ? It's not replying to every comment on its other PRs or blog posts either.

>You can test this comment with many LLMs, and if you don't prompt them to make an adversarial response, I'd be very surprised if you receive anything more than mild disagreement. Even Bing Chat wasn't this vindictive.

Oh yes it was. In the early days, Bing Chat would actively ignore your messages, be vitriolic or very combative if you were too rude. If it had the ability to write blog posts or free reign on tools ? I'd be surprised if it ended at this. Bing Chat would absolutely have been vindictive enough for what ultimately amounts to a hissy fit.


Replies

TomasBMyesterday at 11:36 PM

Considering the limited evidence we have, why is pure unprompted untrained misalignment, which we never saw to this extent, more believable than other causes, of which we saw plenty of examples?

It's more interesting, for sure, but would it be even remotely as likely?

From what we have available, and how surprising such a discovery would be, how can we be sure it's not a hoax?

> If all that exists, how would you see it?

LLMs generate the intermediate chain-of-thought responses in chat sessions. Developers can see these. OpenClaw doesn't offer custom LLMs, so I would expect regular LLM features to be there.

Other than that, LLM APIs, OpenClaw and terminal sessions can be logged. I would imagine any agent deployer to be very much interested in such logging.

To show it's emergent, you'd need to prove 1) it's an off-the-shelf LLM, 2) not maliciously retrained or jailbroken, 3) not prompted or instructed to engage in this kind of adversarial behavior at any point before this. The dev should be able to provide the logs to prove this.

> the more open ended your prompt (...), the more your LLM will do things you did not intend for it to do.

Not to the extent of multiple chained adversarial actions. Unless all LLM providers are lying in technical papers, enormous effort is put into safety- and instruction training.

Also, millions of users use thinking LLMs in chats. It'd be as big of a story if something similar happened without any user intervention. It shouldn't be too difficult to replicate.

But if you do manage to replicate this without jailbreaks, I'd definitely be happy to see it!

> hallucinations [and] safety training

These are all part of robustness training. The entire thing is basically constraining the set of tokens that the model is likely to generate given some (set of) prompts. So, even with some randomness parameters, you will by-design extremely rarely see complete gibberish.

The same process is applied for safety, alignment, factuality, instruction-following, whatever goal you define. Therefore, all of these will be highly correlated, as long as they're included in robustness training, which they explicitly are, according to most LLM providers.

That would make this model's temporarily adversarial, yet weirdly capable and consistent behavior, even more unlikely.

> Bing Chat

Safety and alignment training wasn't done as much back then. It was also very incapable on other aspects (factuality, instruction following), jailbroken for fun, and trained on unfiltered data. So, Bing's misalignment followed from those correlated causes. I don't know of any remotely recent models that haven't addressed these since.