>establish benchmarks that make sense and are reliable
How aren't current LLM coding benchmarks reliable?
They're manipulated.
They're manipulated.