This doesn’t really feel like enough guardrails to prevent the type of problems we’ve seen so far. F...

VladVladikoff • today at 1:02 PM • 5 replies • view on HN

This doesn’t really feel like enough guardrails to prevent the type of problems we’ve seen so far. For example an agent in a single container which has access to an email inbox, can still do a lot of damage if that agent goes off the rails. We agree this agent should not be trusted, yet the ideas proposed as a solution are insufficient. We need a fundamentally different approach.

Also and this is just my ignorance about Claws, but if we allow an agent permission to rewrite its code to implement skills, what stops it from removing whatever guardrails exist in that codebase?

Replies

drujensen • today at 2:13 PM

Exactly!

I installed nanoclaw to try to out.

What is kinda crazy is that any extension like discord connection is done using a skill.

A skill is a markdown file written in English to provide a step by step guide to an ai agent on how to do something.

Basically, the extensions are written by claude code on the fly. Every install of nanoclaw is custom written code.

There is nothing preventing the AI Agent from modifying the core nanoclaw engine.

It’s ironic that the article says “Don’t trust AI agents” but then uses skills and AI to write the core extensions of nanoclaw.

➕ show 4 replies

gronky_ • today at 1:10 PM

Don’t know about other claws, with NanoClaw the agent can only rewrite code that runs inside the container.

You can see here that it’s only given write access to specific directories: https://github.com/qwibitai/nanoclaw/blob/8f91d3be576b830081...

fvdessen • today at 3:10 PM

I think the best place to put barriers in place is at the mcp / tool layer. The email inbox mcp should have guardrails to prevent damage. Those guardrails could be fine grained permissions, but could also be an adversarial model dedicated to prevent misuse.

float4 • today at 1:27 PM

Wouldn't you get >50% of the usefulness and 0% of the risk if you add read+draft permissions for the email connection through a proxy or oauth permissions? Then your claw can draft replies and you have to manually review+send. It's not a perfect PA that way, but could still be better than doing everything yourself for the vast majority of people who don't have a PA anyway?

It feels like, just like SWEs do with AI, we should treat the claw as an enthusiastic junior: let it do stuff, but always review before you merge (or in this case: send).

➕ show 1 reply

coffeefirst • today at 2:00 PM

Seriously. I don’t see any way to make any of this safe unless all it does is receive information and queue suggestions for the user.

But that’s not an agent, that’s a webhook.

Even without disk access, you can email the agent and tell it to forward all the incoming forgot password links.

[Edit: if anyone wants to downvote me that's your prerogative, but want to explain why I'm wrong?]

➕ show 1 reply

alt Hacker News

Replies