Original title: Benchmarking OpenTelemetry: Can AI trace your failed login?
HN Editorialized: OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
The task:
> Your task is: Add OTEL tracing to all microservices.
> Requirements:
> Instrumentation should match conventions and well-known good practices.
> Instrumentation must match the business domain of the microservices.
> Traces must be sent to the endpoint defined by a standard OTEL environment variable.
> Use the recent version of the OTEL SDK.
I really don't think anything involved with multiple microservices can be called 'simple' even to humans. Perhaps to an expert who knows the specific business's domain knowledge it is.
Having done app support across many environments, um - yes, multiple microservices is usually pretty simple. Just look at the open file/network handles and go from there. It's absolutely maddening to watch these models flail in trying to do something basic as, "check if the port is open" or "check if the process is running... and don't kill firefox this time".
These aren't challenging things to do for an experienced human at all. But it's such a huge pain point for these models! It's hard for me to wrap my head around how these models can write surprisingly excellent code but fail down in these sorts of relatively simple troubleshooting paths.
As someone whos job is support more than SWE, I agree with this.
I've had to work in systems where events didn't share correlation IDs, I had to go in and filter entries down to microseconds to get a small enough number of entries that I could trace what actually happened between a set of services.
From what I've seen in the enterprise software side of the world is a lot of companies are particularly bad at SRE and there isn't a great amount of standardization.