How resistant is this against adversarial attacks? For instance, given that you allow `npm test`, it's not too hard to use that to bypass any protections by first modifying the package.json so `npm test` runs an evil command. This will likely be allowed, given that you probably want agents to modify package.json, and you can't possibly check all possible usages. That's just one example. It doesn't look like you check xargs or find, both of which can be abused to execute arbitrary commands.
good challenges! xargs falls to unknown -> ask, and find -exec goes thru a flag classifier that detects the inner command like: find / -exec rm -rf {} + is caught as filesystem_delete outside the project.
The npm test is a good one - content inspection catches rm -rf or other sketch stuff at write time, but something more innocent could slip through.
That said, a realistic threat model here is accidental damage or prompt injection, not Claude deliberately poisoning its own package.json.
But I hear you.. two improvements are coming to address this class of attack:
- Script execution inspection: when nah sees python script.py, read the file and run content inspection + LLM analysis before execution
- LLM inspection for Write and Edit: for content that's suspicious but doesn't match any deterministic pattern, route it to the LLM for a second opinion
Won't close it 100% (a sandbox is the answer to that) but gets a lot better.