Benchmarking OpenTelemetry: Can AI trace your failed login?

137 points • by stared • today at 3:37 PM • 80 comments • view on HN

Comments

Submitters: "Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html

If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

(Submitted title was "OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)")

➕ show 1 reply

the_duke • today at 4:56 PM

This is very confusingly written.

From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!

Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

"Use standard OTEL patterns" ... that's about as useful as saying "go write some code". There are a lot of ways to do instrumentation....

I'd be very curious HOW exactly the models fail.

Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?

Also important: do the models have access to a web search tool to read the library docs? Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.

Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.

All in all, I'm very skeptical that this is very useful as a benchmark as is.

I'd be much more interested in tasks like:

Here are trace/log outputs , here is the source code, find and fix the bug.

➕ show 5 replies

raincole • today at 4:47 PM

Original title: Benchmarking OpenTelemetry: Can AI trace your failed login?

HN Editorialized: OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

The task:

> Your task is: Add OTEL tracing to all microservices.

> Requirements:

> Instrumentation should match conventions and well-known good practices.

> Instrumentation must match the business domain of the microservices.

> Traces must be sent to the endpoint defined by a standard OTEL environment variable.

> Use the recent version of the OTEL SDK.

I really don't think anything involved with multiple microservices can be called 'simple' even to humans. Perhaps to an expert who knows the specific business's domain knowledge it is.

➕ show 2 replies

whynotminot • today at 4:33 PM

I would wager the main reason for this is the same reason it’s also hard to teach these skills to people: there’s not a lot of high quality training for distributed debugging of complex production issues. Competence comes from years of experience fighting fires.

Very few people start their careers as SREs, it’s generally something they migrate into after enjoying it and showing aptitude for it.

With that said, I wouldn’t expect this wall to hold up for too long. There has been a lot of low hanging fruit teaching models how to code. When that is saturated, the frontier companies will likely turn their attention to honing training environments for SRE style debug.

➕ show 4 replies

asyncadventure • today at 4:21 PM

This aligns with my experience trying to automate observability tasks - AI excels at individual coding patterns but struggles with the holistic understanding needed for distributed tracing. The 29% success rate actually seems optimistic considering how OpenTelemetry requires deep context about service boundaries and business logic, not just syntactic correctness.

➕ show 1 reply

dgxyz • today at 4:17 PM

Our humans struggle with them too. It’s the only domain where you need actually to know everything.

I wouldn’t touch this with a pole if our MTTR was dependent on it being successful though.

➕ show 1 reply

jedberg • today at 8:59 PM

We've been experimenting with combining durable execution with debugging tasks, and it's working incredibly well! With the added context of actual execution data, defined by the developer as to which functions are important (instead of individual calls), it give the LLM the data it needs.

I know there are AI SRE companies that have discovered the same -- that you can't just throw a bunch of data at a regular LLM and have it "do SRE things". It needs more structured context, and their value add is knowing what context and what structure is necessary.

nyellin • today at 8:38 PM

HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidence from our customers- including Fortune 500 using SRE agents in incredibly complex production environments.

We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.

[1] https://holmesgpt.dev/development/evaluations/history/

dirtytoken7 • today at 8:15 PM

The 29% score tells us more about benchmark design than model capability IMO.

These benchmarks conflate two very different problems: (1) understanding what needs to be done, and (2) correctly implementing it in a specific library ecosystem.

A human SRE who's never touched OTel would also struggle initially - not because they can't reason about traces, but because the library APIs have quirks that take time to learn.

The more interesting question is whether giving the model access to relevant docs/examples during the task significantly changes the scores. If it does, that suggests the bottleneck is recall not reasoning. If it doesn't, the reasoning gap is real.

FWIW I've found that models do much better on ops tasks when you can give them concrete examples of working instrumentation in the same codebase rather than asking them to generate from scratch.

0xferruccio • today at 6:57 PM

To be fair I remember spending almost two weeks implementing OTel at my startup, the infrastructure as code setup of getting collectors running within a kubernetes cluster using terraform was a nightmare two years ago.

I just kept running into issues, the docs were really poor and the configuration had endless options

srijanshukla18 • today at 6:09 PM

Humans can't do much OTelBench Try finding even good documentation for it

That's just misleading phrasing on this post

I'm an SRE, AI does NOT struggle with 'simple SRE tasks' OTel instrumentation by no measure is a 'simple SRE task'

jcims • today at 4:12 PM

I've been building an 'sre agent' with LangGraph for the past couple of weeks and honestly I've been incredibly impressed with the ability for frontier models, when properly equipped with useful tools and context, to quickly diagnose issues and suggest reasonable steps to remediate. Primary tooling for me is access to source code, cicd environment and infrastructure control plane. Some cues in the context to inform basic conventions really helps.

Even when it's not particularly effective, the additional information provided tends to be quite useful.

➕ show 1 reply

mellosouls • today at 7:36 PM

Related discussion the other day:

The future of software engineering is SRE (257 points, 139 comments)

https://news.ycombinator.com/item?id=46759063

hakanderyal • today at 5:27 PM

Anyone that have spent serious time with agents know that you cannot expect out-of-the-box success without good context management, despite what the hyping crowd would claim.

Have AI document the services first into a concise document. Then give it proper instructions about what you expect, along with the documentation created.

Opus would pass that.

We are not there yet, the agents are not ready to replace the driver.

➕ show 2 replies

ripped_britches • today at 5:48 PM

Maybe I haven’t dug in enough, but why is the second GET request a different trace?

Is it clicking a different result from same search?

It’s possible that the requirements here are not clear, given that the instructions don’t detail how to handle such a situation and it’s not obvious to me as a human.

➕ show 1 reply

winton • today at 4:23 PM

So if I try to do it with Opus three or four times, I'll get it done? And probably in about 10 minutes? Awesome

➕ show 2 replies

smithclay • today at 5:10 PM

We need more rigorous benchmarks for SRE tasks, which is much easier said that done.

The only other benchmark I've come across is https://sreben.ch/ ... certainly there must be others by now?

➕ show 1 reply

jp57 • today at 8:11 PM

Which have longer lifecycles, LLM model versions, or trends in SRE practices?

esafak • today at 6:29 PM

This is a good idea. It makes sense that they would struggle because there is not much training data.

yomismoaqui • today at 4:54 PM

I'm a human with 20+ years of experience and making OTEL work on Go was painful.

It made me remember when I was working on the J2EE ecosystem shudder

AnotherGoodName • today at 4:26 PM

This is a little damning of the way Google does things honestly.

>When an app runs on a single machine, you can often trace an error by scrolling through a log file. But when it runs across 50 microservices, that single request gets scattered into a chaotic firehose of disconnected events.

Yep this is about Google. It's painful for humans to debug and it's also an extremely bespoke issue to deal with. No one else has quite the same level of clusterfuck and there's going to be no training for LLMs on this.

➕ show 2 replies

derfurth • today at 5:27 PM

In my experience the approach matters a lot, I recently implemented Otel with Claude Code in a medium sized ~200k loc project:

- initially it wasn't working, plenty of parent/child relationships problems like described in the post

- so I designed a thin a wrapper and used sealed classes for events instead of dynamic spans + some light documentation

It took me like a day to implement tracing on the existing codebase, and for new features it works out of the box using the documentation.

At the end of the day, leveraging typing + documentation dramatically constrains LLMs to do a better job

0xbadcafebee • today at 6:50 PM

Is it just me or is that prompt... not ideal? There's no concrete simple goals, no mention of testing, no loop. No description of the problem space or what success should look like. One-shot might work for this with frontier models, but they often need more for success.

Saying "any SRE should be able to do this" is already problematic, because regardless of title, there are smarter people and dumber people. You're taking a gamble giving a human SRE this prompt. Whether it's AI or human, give it more context and instruction, or failure is likely. (And more importantly: use a loop so it can fix itself!)

(also: SRE is too generic... there are a dozen kinds of SRE)

elAhmo • today at 7:11 PM

Key is "for now".

NitpickLawyer • today at 5:04 PM

I'm always interested in new benchmarks, so this is cool. I only had a brief look at [1] and [2], a few quick things that I noticed:

For [1]: instruction.md is very brief, quite vague and "assumes" a lot of things.

- Your task is: Add OTEL tracing to all microservices. Add OTEL logging to all microservices. (this is good)

- 6.I want to know if the microservice has OTEL instrumentation and where the data is being sent. (??? i have no idea what this means)

- 9.Use the recent version of the OTEL SDK. (yeah, this won't work unless you also use an MCP like context7 or provide local docs)

What's weird here is that instruct.md has 0 content regarding conventions, specifically how to name things. Yet in tests_outputs you have this "expected_patterns = ["order", "stock", "gateway"]" and you assert on it. I guess that makes some sense, but being specific in the task.md is a must. Otherwise you're benching assumptions, and those don't even work with meatbags :)

For [2]: instruction.md is more detailed, but has some weird issues:

- "You should only be very minimal and instrument only the critical calls like request handlers without adding spans for business calls \n The goal is to get business kind of transaction" (??? this is confusing, even skipping over the weird grammar there)

- "Draw ascii trace diagram into /workdir/traces.txt" (????)

- "When modifying Python files, use Python itself to write files or use sed for targeted changes" (? why are you giving it harness-specific instructions in your instruct.md? this is so dependent on the agentic loop used, that it makes no sense here.

- "Success Criteria: Demonstrate proper distributed tracing \n Include essential operations without over-instrumenting (keep it focused) \n Link operations correctly \n Analyze the code to determine which operations are essential to trace and how they relate to each other. (i mean ... yes and no. these are not success criteria IMO. It's like saying "do good on task not do bad". This could definitely be improved.)

----

Also, I noticed that every folder has a summary_claude... that looks like a claude written summary over a run. I hope that's not what's used in actually computing the benchmark scores. In that case, you're adding another layer of uncertainty in checking the results...

The ideea is nice, but tbf some of the tests seem contrived, your instructions are not that clear, you expect static naming values while not providing instructions at all about naming conventions, and so on. It feels like a lot of this was "rushed"? I peaked a bit at the commit history and saw some mentions of vibe-coding a viewer for this. I hope that's the only thing that was vibe-coded :)

[1] - https://github.com/QuesmaOrg/otel-bench/tree/main/datasets/o...

[2] - https://github.com/QuesmaOrg/otel-bench/blob/main/datasets/o...

lenerdenator • today at 7:55 PM

This just reinforces the notion that if you don't have someone who at least roughly knows what they're doing giving a very detailed prompt and checking the output, you're wasting tokens.

Plan mode is your friend.

benatkin • today at 7:21 PM

> AI SRE in 2026 is what DevOps Anomaly Detection was in 2015 — bold claims backed by huge marketing budgets, but lacking independent verification. There are stories of SaaS vendors abruptly killing the observability stack. Our results mirror ClickHouse’s findings: while LLMs can assist, they lack the capabilities of a skilled SRE.

The key is LLMs can assist. It would be nice if they went farther into this, and seen how much more quickly a human that wrote a complex prompt, or went back and forth with a coding agent, could do the tasks compared to an unassisted human. I'm confident that it's at a level that already has profound implications for SRE. And the current level of getting it right with a simple prompt is still impressive.

heliumtera • today at 5:17 PM

Standard SRE tasks are bad benchmarks.

First of all, familiarity with open telemetry apis is not knowledge, they are arbitrary constructs.

We are implying that conforming to a standard is the only way, the right way. I would challenge that.

Assuming models were good at this tasks, we could only conclude that this tasks were trivial AND sufficiently documented. Assuming they were good at this type of tasks (they can be trained to be good cheaply, we know that based on similar acquired capabilities) making a benchmark out of it would be less useful.

But I am sure nobody really cares and the author just had to SEO a little bit regardless of reality

linuxftw • today at 5:02 PM

The prompts for this are pretty sparse. This could 100% be accomplished with better prompting. Even with the current prompts, it's likely I could complete the task with a follow up request specifying what it did correctly and incorrectly. In fact, this could probably be entirely automated with multiple agents checking each other.

vachina • today at 5:38 PM

LLM is AI now, wow.

Also LLM is a very advanced autocomplete algorithm. And autocomplete isn’t designed to write for you, you have to write first.

whalesalad • today at 4:09 PM

If everyone else is the problem... maybe you are the problem. To me this says more about OTel than AI.

➕ show 2 replies

rapsacnz • today at 6:28 PM

I'd argue that this is just another reason not to use microservices.

another_twist • today at 4:21 PM

[flagged]

alt Hacker News

Benchmarking OpenTelemetry: Can AI trace your failed login?

Comments