We need more rigorous benchmarks for SRE tasks, which is much easier said that done.
The only other benchmark I've come across is https://sreben.ch/ ... certainly there must be others by now?
We publish the benchmarks for HolmesGPT (CNCF sandbox project) at https://holmesgpt.dev/development/evaluations/
We publish the benchmarks for HolmesGPT (CNCF sandbox project) at https://holmesgpt.dev/development/evaluations/