from my understanding, you can run the inference server (llama.cpp/vllm/whatever) and the agent/harness in different contexts, event different machines.
The risky part is in the agent/harness and what tools it has access to.
You don't need to give GPU passthrough to the VM running the agent/harness.
There is still a risk of a prompt messing with the inference server, but I think that's a much lower risk compared to an agent doing whatever on its own.
from my understanding, you can run the inference server (llama.cpp/vllm/whatever) and the agent/harness in different contexts, event different machines.
The risky part is in the agent/harness and what tools it has access to.
You don't need to give GPU passthrough to the VM running the agent/harness.
There is still a risk of a prompt messing with the inference server, but I think that's a much lower risk compared to an agent doing whatever on its own.