You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one.
Volume and statistical significance? I'm not sure what kind of narrative I would trust beyond the actual data.
It's the hard part of using LLMs and a mistake I think many people make. The only way to really understand or know is to have repeatable and consistent frameworks to validate your hypothesis (or in my case, have my hypothesis be proved wrong).
You can't get to 100% confidence with LLMs.