Can you help me understand where you are coming from? Is it that you think the benchmark is flawed or overly harsh? Or that you interpret the tone as blaming AI for failing a task that is inherently tricky or poorly specified?
My takeaway was more "maybe AI coding assistants today aren’t yet good at this specific, realistic engineering task"....
Where I work we are looking at a lot of our documentation and implementations where AI has a hard time when doing it.
This almost always correlates with customers having similar issues in getting things working.
This has lead us to rewrite a lot of documentation to be more consistent and clear. In addition we set out series of examples from simple to complex. This shows as less tickets later, and more complex implementations being setup by customers without the need for support.
In my experience many OTEL libraries are aweful to use and most of the "official" ones are the worst offenders as the are largely codegened. That typically makes them feel clunky to use and they exhibit code patterns that are non-native to the language used, which would an explanation of why AI systems struggle with the benchmark.
I think you would see similar results if tasking an AI to e.g. write GRPC/Protobuf systems using only the builtin/official protobuf codegen languages.
Where I think the benchmark is quite fair is in the solutions. It looks like for each of the languages (at least the ones I'm familiar with), the "better" options were chosen, e.g. using `tracing-opentelemtry` rather than `opentelemetry-sdk` directly in Rust.
However the one-shot nature of the benchmark also isn't that reflective of the actual utility. In my experience, if you have the initial framework setup done in your repo + a handful of examples, they do a great job of applying OTEL tracing to the majority of your project.