You make a compelling argument, but thankfully I have data to back up my anecdotal experience
This comparison shows them neck and neck https://benchlm.ai/compare/claude-sonnet-4-5-vs-gemma-4-31b
As Does this one https://llm-stats.com/models/compare/claude-sonnet-4-6-vs-ge...
And the pelican benchmark even shows them pretty close https://simonwillison.net/2026/Apr/2/gemma-4/ https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/
Also this isn't a fringe statement, you can see most people who have done an evaluation agree with me
if you look at the details of the numbers of the benchmarks that you shared, Sonnet 4.5 crushes gemma 4. Somehow the first link doesn't run Sonnet on the multi modal benchmark, that's why the top score looks close, it beats Gemma at every benchmark they actually ran. The arena in the second shows that it actually destroys Gemma 4 as well, not close
I think one area I find hard to get around is context length. Everything self hosted is so limited on length that it is marginal to use. Additionally I think that the tools (like claude code) are clearly in the training mix for Anthropic's models so they seem to get a boost over other models pushed into that environment. That being said, open source and local inference is -really- good and only going to get better. There is no doubt that the current frontier biz model is not sustainable.