The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.
If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):
on the bright side also worth to keep in mind those tiny models are better than GPT 4.0, 4.1 GPT4o that we used to enjoy less than 2 years ago [1]
[1] https://artificialanalysis.ai/?models=gpt-5-4%2Cgpt-oss-120b...