grok is 17%? And that's the lowest, most models are like 80%+?
While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.
No one serious uses grok.
No one serious uses grok.