Thinking that on prem models will be a halfway decent solution against what can be served out of a data center is a fools take... One that is more common than it should be on here...
The recent MiMo-V2.5-Pro-UltraSpeed can be served from 8 GPUs, which is certainly within the reach of sophisticated on-prem setups. https://mimo.xiaomi.com/blog/mimo-tilert-1000tps
If we’re defining on-prem as fitting in a rack - then every frontier model can be hosted on-prem.
Now this might not be the most cost effective (and may require a bit extra power), but you only need a datacenter for training or cost optimization.
The point is not to be as good as the multi-trillion parameter model you can host in across 72 GPUs (or whatever).
I'm running a 248B model on a paltry amount of hardware and getting plenty of good use out of it.
Sure, the most demanding tasks will demand the best models (and always will). There's still less demanding tasks for other models.
I think some people are fooling themselves that coding of all tasks is always going to requires the biggest models ever. Again, maybe some coding tasks will, but the majority of business CRUD apps probably don't. Same goes for virtually any other type of task. The biggest models are really only useful for the most complex tasks.