It's way too early to make firm predictions here, but if you're not already in the field it's helpful to know there's been 20 years of effort at automating "pen-testing", and the specific subset of testing this project focused on (network pentesting --- as opposed to app pentesting, which targets specifically identified network applications) is already essentially fully automated.
I would expect over the medium term agent platforms to trounce un-augmented human testing teams in basically all the "routinized" pentesting tasks --- network, web, mobile, source code reviews. There are too many aspects of the work that are just perfect fits for agent loops.
So the stuff that agents would excel at is essentially just the "checklist" part of the job? Check A, B, C, possibly using tools X, Y, Z, possibly multi-step checks but everything still well-defined.
Whereas finding novel exploits would still be the domain of human experts?
It's not way too early, imo. This is the academic nerds proof of concept for a school research project, it's not the "group of elite hackers get together and work out a world-class production ready system".
Agent platforms have similar modes of failure, whether it's creative writing, coding, web design, hacking, or any other sort of project scaffolding. A lot of recent research has dealt with resolving the underlying gaps in architectures and training processes, and they've had great success.
I fully expect frontier labs to have generalized methodologizing capabilities withing the first half of the year, and by the end of the year, the Pro/Max/Heavy variants of the chatbots will have the capabilities baked in fully. Instead of having Codex or Artemis or Claude Code, you can just ask the model to think and plan your project, whatever the domain, and get professional class results, as if an expert human was orchestrating the project.
All sorts of complex visual tool use like PCB design and building plans and 3d modeling have similar process abstractions, and the decomposition and specialized task executions are very similar in principle to the generalized skills I mentioned. I think '26 is going to be exciting as hell.
The app automated pentest scanners find the bottom 10-20% of vulns, no real pentester would consider them great. Agents might get us to 40%-50% range, what they are really good at is finding "signals" that the human should investigate.