It has a SimpleQA score of 69%, a benchmark that tests knowledge on extremely niche facts, that's actually ridiculously high (Gemini 2.5 *Pro* had 55%) and reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model.
I'm speculating but Google might have figured out some training magic trick to balance out the information storage in model capacity. That or this flash model has huge number of parameters or something.
Or could it be that it's using tool calls in reasoning (e.g. a google search)?
This will be fantastic for voice. I presume Apple will use it
>or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model.
More experts with a lower pertentage of active ones -> more sparsity.
Also
https://artificialanalysis.ai/evaluations/omniscience
Prepare to be amazed