logoalt Hacker News

samuelknighttoday at 12:18 AM4 repliesview on HN

My startup builds agents for penetration testing, and this is the bet we have been making for over a year when models started getting good at coding. There was a huge jump in capability from Sonnet 4 to Sonnet 4.5. We are still internally testing Opus 4.5, which is the first version of Opus priced low enough to use in production. It's very clever and we are re-designing our benchmark systems because it's saturating the test cases.


Replies

vngzstoday at 2:27 AM

How do you manage to coax public production models into developing exploits or otherwise attacking systems? My experience has been extremely mixed, and I can't imagine it boding well for a pentesting tools startup to have end-users face responses like "I'm sorry, but I can't assist you in developing exploits."

show 1 reply
carsoontoday at 3:43 AM

Yeah this latest generation of models (Opus 4.5 GPT 5.1 and Gemini Pro 3) are the biggest breakthrough since gpt-4o in my mind.

Before it felt like they were good for very specific usecases and common frameworks (Python and nextjs) but still made tons of mistakes constantly.

Now they work with novel frameworks and are very good at correcting themselves using linting errors, debugging themselves by reading files and querying databases and these models are affordable enough for many different usecases.

dborehamtoday at 1:32 AM

I've had similar experience using LLMs for static analysis of code looking for security vulnerabilities, but I'm not sure it makes sense for me to found a start up around that "product". Reason being that the technology with the moat isn't mine -- it belongs to Anthropic. Actually it may not even belong to them, probably it belongs to whoever owns the training data they feed their models. Definitely not me though. Curious to hear your thoughts on that. Is the idea to just try for light speed and exit before the market figures this out?

show 2 replies
VladVladikofftoday at 12:43 AM

I have a hotel software startup and if you are interested in showing me how good your agents are you can look us up at rook like the chess piece, hotel dot com

show 1 reply