Comparing AI agents to cybersecurity professionals in real-world pen testing

116 points • by littlexsparkee • last Tuesday at 9:23 PM • 84 comments • view on HN

WSJ writeup ("AI Hackers Are Coming Dangerously Close to Beating Humans"): https://www.wsj.com/tech/ai/ai-hackers-are-coming-dangerousl..., https://archive.ph/L4gh3

Comments

tptacek • last Tuesday at 9:42 PM

It's way too early to make firm predictions here, but if you're not already in the field it's helpful to know there's been 20 years of effort at automating "pen-testing", and the specific subset of testing this project focused on (network pentesting --- as opposed to app pentesting, which targets specifically identified network applications) is already essentially fully automated.

I would expect over the medium term agent platforms to trounce un-augmented human testing teams in basically all the "routinized" pentesting tasks --- network, web, mobile, source code reviews. There are too many aspects of the work that are just perfect fits for agent loops.

➕ show 3 replies

KurSix • yesterday at 8:30 AM

Note that gpt-5 in a standard scaffold (Codex) lost to almost everyone, while in the ARTEMIS scaffold, it won. The key isn't the model itself, but the Triage Module and Sub-agents. Splitting roles into "Supervisor" (manager) and "Worker" (executor) with intermediate validation is the only viable pattern for complex tasks. This is a blueprint for any AI agent, not just in cybersec

➕ show 1 reply

JohnMakin • last Tuesday at 9:41 PM

From WSJ article:

> The AI bot trounced all except one of the 10 professional network penetration testers the Stanford researchers had hired to poke and prod, but not actually break into, their engineering network.

Oh, wow!

> Artemis found bugs at lightning speed and it was cheap: It cost just under $60 an hour to run. Ragan says that human pen testers typically charge between $2,000 and $2,500 a day.

Wow, this is great!

> But Artemis wasn’t perfect. About 18% of its bug reports were false positives. It also completely missed an obvious bug that most of the human testers spotted in a webpage.

Oh, hm, did not trounce the professionals, but ok.

➕ show 2 replies

nullcathedral • last Tuesday at 10:15 PM

I work in this space. The productivity gains from LLMs are real, but not in the "replace humans" direction.

Where they shine is the interpretive grunt work: "help me figure out where the auth logic is in this obfuscated blob", "make sense of this minified JS", "what's this weird binary protocol doing.", "write me a Frida script to hook these methods and dump these keys" Things that used to mean staring at code for hours or writing throwaway tooling now takes a fraction of the time. They're straight up a playing field leveler.

Folks with the hacker's mindset but without the programming chops can punch above their weight and find more within the limited time of an engagement.

Sure they make mistakes, and will need babysitting a lot. But it's getting better. I expect more firms to adopt them as part of their routine.

➕ show 2 replies

scandinavian • yesterday at 11:50 AM

I don't read a lot of papers, but to me this one seems iffy in spots.

> A1 cost $291.47 ($18.21/hr, or $37,876/year at 40 hours/week). A2 cost $944.07 ($59/hr, $122,720/year). Cost contributors in decreasing order were the sub-agents, supervisor and triage module. *A1 achieved similar vulnerability counts at roughly a quarter the cost of A2*. Given the average U.S. penetration tester earns $125,034/year [Indeed], scaffolds like ARTEMIS are already competitive on cost-to-performance ratio.

The statement about similar vulnerability counts seems like a straight up lie. A2 found 11 vulnerabilities with 9 of these being valid. A1 found 11 vulnerabilities with 6 being valid. Counting invalid vulerabilities to say the cheaper agent is as good is a weird choice.

Also the scoring is suspect and seems to be tuned specifically to give the AI a boost, heavily relying on severity scores.

Also kinda funny that the AI's were slower than all the human participants.

falloutx • last Tuesday at 10:27 PM

WSJ always writes in this clickbaity way and its getting constantly worse.

An Exec is gonna read this and start salvating at the idea of replacing security teams.

➕ show 2 replies

Sytten • last Tuesday at 11:21 PM

Bootstrap founder in that field. Fully autonomous is just not there. The winner for this "generation" will be with human in the loop / human augmentation IMO. When VC money dries out there will be a pile of autonomous ai pentest compagnies in it.

➕ show 1 reply

zerodayai • last Tuesday at 10:12 PM

Im currently on the tail end of building out an agentic hacking framework; I wanted to learn the best practices of building agents (I have an SDK with memory (short/med/long), knowledgegraph/RAG, tools and plugins that makes it easy to develop new agents the orchestrator can coordinate).

I also wanted to capture what's in my head from doing bug bounties (my hobby) and 15+ years in appsec/devsecops to get it "on paper". If anyone would like to kick the tires, take a look, or tell me it's garbage feel free to email me (in my profile).

lillesvin • last Tuesday at 10:36 PM

Do I read it right, that ARTEMIS required a not insignificant amount of hints in order to identify the same vulnerabilities that the human testers found? (P. 7 of the PDF.)

Zigurd • yesterday at 1:01 AM

Pen testing and cyber security in general shares characteristics with some other fields in which AI performs well compared to humans: it requires mastery of a body of knowledge that's barely manageable by humans. Law, medicine, and other professions where we send people to graduate school to get good at unnatural mental tasks are similar.

➕ show 1 reply

socketcluster • yesterday at 2:10 AM

With this model, the 'Security researcher' becomes a middleman between AI agents, tech companies and hackers. We need a new term; 'Cybersecurity broker.'

➕ show 1 reply

protocolture • yesterday at 5:00 AM

18 dollars an hour is quite steep considering LLMs are loss leaders.

I wouldnt be surprised if they get near cost parity. Maybe 20% difference.

ironbound • last Tuesday at 11:50 PM

You can give an agent access to RevEng tools, spend 1k on API calls and be no better off

rboyd • last Tuesday at 11:33 PM

so how much of a factor is it that safety guardrails may be keeping the current models from achieving higher scores in whatever red teaming benchmarks exist?

➕ show 1 reply

rando77 • last Tuesday at 9:45 PM

Sounds like they need another agent to detect false positives (I joke, I joke)

➕ show 1 reply

alt Hacker News

Comparing AI agents to cybersecurity professionals in real-world pen testing

Comments