HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidenc...

nyellin • yesterday at 8:38 PM • 0 replies • view on HN

HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidence from our customers- including Fortune 500 using SRE agents in incredibly complex production environments.

We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.

[1] https://holmesgpt.dev/development/evaluations/history/

alt Hacker News