IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.
But that's removing a component that's critical for the test. We as users/benchmark consumers care that the service as provided by Anthropic/OpenAI/Google is consistent over time given the same model/prompt/context
With the insane valuations and actual revenue at stake, benchmarkers should assume they're assessing in an adversarial environment. Whether from intentional gaming, training to the test, or simply from prioritizing things likely to make results look better, targeting benchmarks will almost certainly happen.
We already know large graphics card manufacturers tuned their drivers to recognize specific gaming benchmarks. Then when that was busted, they implemented detecting benchmarking-like behavior. And the money at stake in consumer gaming was comparatively tiny compared to current AI valuations. The cat-and-mouse cycle of measure vs counter-measure won't stop and should be a standard part of developing and administering benchmark services.
Beyond hardening against adversarial gaming, benchmarkers bear a longer term burden too. Per Goodhart's Law, it's inevitable good benchmarks will become targets. The challenge is the industry will increasingly target performing well on leading benchmarks, both because it drives revenue but also because it's far clearer than trying to glean from imprecise surveys and fuzzy metrics what helps average users most. To the extent benchmarks become a proxy for reality, they'll bear the burden of continuously re-calibrating their workloads to accurately reflect reality as user's needs evolve.