logoalt Hacker News

SwellJoetoday at 7:48 AM0 repliesview on HN

Also, with regard to tools, I originally ran a batch of several models in a full-featured agent (and whatever tools the agent provides), and they didn't perform better than the basic minimal harness with just read and grep. They chewed more tokens but didn't find more bugs. I'm currently doing tests with more advanced tools, like tree-sitter so the model can better understand execution and data flow and semgrep (which is almost cheating, since it finds bugs on its own, but worth a try since models can still be useful in helping rule out false positives and suggest mitigations). When I've got time for it, I'll also give them a full dev environment with compiler, debugger, and maybe fuzzer, and a loop that iterates through a security bug hunting checklist (since a single prompt and context window can't handle that much complexity at once).