There is likely a theoretical limit to how much intelligence you can pack into a model of a given size (especially when stretching that over a large input context size).
Our evals are pretty complex so we only recently started testing ~30B class models, which are now becoming quite smart (on par with the frontier from 1 year ago). Mistral is far behind, but I'm rooting for them.
Data at https://gertlabs.com/rankings