I have seen many LLM devs' encountered this at some point. Good to see that you are not only pointing out the inconsistency but also actively advocating a common benchmark.