Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.
Source: https://artificialanalysis.ai/models?omniscience=omniscience...
There's something off with this because Haiku should not be that good.
This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.
LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.
grok is 17%? And that's the lowest, most models are like 80%+?
While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.