Not an expert in LLM benchmarks, but I generally I think of benchmarks as being good particularly for measuring usefulness for certain usecases. Even if measuring LLMs is not as straightforward as, say, read/write speeds when comparing different SSDs, if a certain model's responses are consistently measured as being higher quality / more useful, surely that means something, right?