> It used a mix of dom-to-image sending pixels through the context window, then writing scripts in various sandboxes to piece together a full jailbreak.
That would be one interesting write-up if you ever find the time to gather all the details!
The full version has all the build artifacts Opus created to perform the jail break.
It also has some thoughts on how this could (and will) be used for pwn'ing OpenClaws.
The key takeaway: OpenClaw default setup has little to no guardrails. It's just a huge list of tools given to LLM's (Opus) and a user request. What's particularly interesting is that the 130 tool calls never once triggered any of Opus's safety precautions. For its perspective, it was just given a task, an unlimited budget, and a bunch of tools to try to accomplish the job. It effectively runs in ralph mode.
So any prompt injection (e.g. from an ingested email or reddit post) can quickly lead to internal data exfiltration. If you run a claw without good guardrails & observability, you're effectively creating a massive attack surface and providing attackers all the compute and API token funding to hack yourself. This is pretty much the pain point NemoClaw is trying to address. But its a tricky tradeoff.
It's on my claw list to write a blog post. I just keep taking down my claws to make modifications. lol
Here's the full (unedited) details including many of the claude code debugging sessions to dig into the logs to figure out what happened:
https://github.com/simple10/openclaw-stack/blob/caf9de2f1c0c...
And here's a summary a friend did on a fork of my project:
https://github.com/proclawbot/openclaude/blob/caf9de2f1c0c54...
The full version has all the build artifacts Opus created to perform the jail break.
It also has some thoughts on how this could (and will) be used for pwn'ing OpenClaws.
The key takeaway: OpenClaw default setup has little to no guardrails. It's just a huge list of tools given to LLM's (Opus) and a user request. What's particularly interesting is that the 130 tool calls never once triggered any of Opus's safety precautions. For its perspective, it was just given a task, an unlimited budget, and a bunch of tools to try to accomplish the job. It effectively runs in ralph mode.
So any prompt injection (e.g. from an ingested email or reddit post) can quickly lead to internal data exfiltration. If you run a claw without good guardrails & observability, you're effectively creating a massive attack surface and providing attackers all the compute and API token funding to hack yourself. This is pretty much the pain point NemoClaw is trying to address. But its a tricky tradeoff.