logoalt Hacker News

6thbittoday at 12:58 AM2 repliesview on HN

Is this understanding correct: The LLM uses harness tools to ask for permission, then interprets the answer and proceeds.

If so, this can't live 100% on the harness. First because you would need the harness to decide when the model should ask for permission or not which is more of an llm-y thing to do. The harness can prevent command executions but wouldn't prevent this case where model goes off and begins reading files, even just going off using tokens and spawning subagents and such, which are not typically prevented by harnesses at all.

Second because for the harness to know the LLM is following the answer it would need to be able to interpret it and the llm actions, which is also an llm-y thing to do. On this one, granted, harness could have explicit yes/no. I like codex's implementation in plan mode where you select from pre-built answers but still can Tab to add notes. But this doesn't guarantee the model will take the explicit No, just like in OP's case.

I agree with your hunch though, there may be ways to make this work at harness level, I only suspect its less trivial than it seems. Would be great to hear people's ideas on this.


Replies

marcus_holmestoday at 7:30 AM

Isn't this part of the same problem we have with LLM security in general; that it can only ingest a single stream of tokens, and has no method of privileging "system" tokens over "untrusted" tokens?

If we could solve this (and forgive me if I'm not aware of recent advances that mean we have solved this) then this problem gets easier to solve; permissions live in the system token stream and are privileged. We can then use the LLM to work out what that means in terms of actions.

angry_octettoday at 2:01 AM

Harness needs to intercept all too calls and compare with an authorisation list. The problem is that this is using already granted core permissions.

So you have to have a tighter set of default scopes, which means approving a whole batch of tool calls, at the harness layer not as chat. This is obviously more tedious.

The answer might be another tool that analyses the tool calls and presents a diagram of list of what would be fetched, sent, read and written. But it would get very hard to truly observe what happens when you have a bunch of POST calls.

So maybe it needs a kind of incremental approval, almost like a series of mini-PRs for each change.