Open Source isn't even within 50% of what the SOTA models are. Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.
Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters. I'm not a poor college student anymore, and I need more return on my time.
I'm not shitting on open weights here - I want open source to win. I just don't see how that's possible.
It's like Photoshop vs. Gimp. Not only is the Gimp UX awful, but it didn't even offer (maybe still doesn't?) full bit depth support. For a hacker with free time, that's fine. But if my primary job function is to transform graphics in exchange for money, I'm paying for the better tool. Gimp is entirely a no-go in a professional setting.
Or it's like Google Docs / Microsoft Office vs. LibreOffice. LibreOffice is still pretty trash compared to the big tools. It's not just that Google and Microsoft have more money, but their products are involved in larger scale feedback loops that refine the product much more quickly.
But with weights it's even worse than bad UX. These open weights models just aren't as smart. They're not getting RLHF'd on real world data. The developers of these open weights models can game benchmarks, but the actual intelligence for real world problems is lacking. And that's unfortunately the part that actually matters.
Again, to be clear: I hate this. I want open. I just don't see how it will ever be able to catch up to full-featured products.
I think that there will come a point when open source models are "good enough" for many tasks (they probably already are for some tasks; or at least, some small number of people seem happy with them), but, as you suggest, it will likely always (for the forseeable future at least) be the case that closed SOTA models are significantly ahead of open models, and any task which can still benefit from a smarter model (which will probably always remain some large subset of tasks) will be better done on a closed model.
The trick is going to be recognizing tasks which have some ceiling on what they need and which will therefore eventually be doable by open models, and those which can always be done better if you add a bit more intelligence.
> Benchmarks are toys, real world use is vastly different...Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters.
This kind of rhetoric is not helpful. If you want to make a point, then make one, but this adds nothing to the conversation. Maybe open source models don't work for you. They work very well for me.
> Open Source isn't even within 50% of what the SOTA models are.
The gap has been shrinking with each release, and the SOTA has already run into diminishing returns for each extra unit of data+computation it uses.
Do you really want to bet that the gap will not eventually be a hairs breadth?
There's going to be a day when we look back at $200/mo price tags and say "wow that was cheap".
The breakeven at this price is 6 minutes of productivity per work day for an engineer making $200k.
> Open Source isn't even within 50% of what the SOTA models are.
When was the last time you used any of them? Because, a lot of people are actively using them for 9-5 work today, I count myself in that group. That opinion feels outdated, like it was formed a year ago+ and held onto. Or based on highly quantized versions and or small non-Thinking models.
Do you really think Qwen3.6 for a specific example is "50%" as good as Opus4.7? Opus4.7 is clearly and objectively better, no debate on that, but the gap isn't anywhere near that wide. I'd call "20%" hyperbole, the true difference is difficult to exactly measure but sub-10% for their top-tier Thinking models is likely.
> Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.
I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?
You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.
> Why should anyone waste time on poorer results?
Because in almost no real-world project is "programming time" the limiting factor?
IMO It's a different and new model. We're engineers, and we're rich. It's not going to be good enough for us. But the much larger market by far is all the people who used to HAVE to work with engineers. They now have optionality; the pendulum is going to swing.
Also, this space will (and perhaps already is for some of us) be an arms race. Sure you can go local but hosted will always be able to offer more and if you want to be competitive, you'll need to be using the most capable.
People pirate photoshop and office if they don't want to pay for it, making it as "free" as GIMP. If there is a free option people will use it. never underestimate the cheapskates.
If sharing all of your code with the closed providers is OK then it works. If that is a blocker, open weights becomes much more compelling...
What will you do when they stop burning cash and the $200 plan becomes $2000?
> Open Source isn't even within 50% of what the SOTA models are
Who said so? GLM 5.1 is 90% Opus, at least. Some people quite happy with Kimi 2.6 too. I did not try Deepseek 4 yet but also hearing it is as good as Opus. You might be confusing open source models with local models. It is not easy to run a 1.6T model locally, but they are not 50% of SOTA models.
I think the problem is that we're all waiting for the patented Silicon Value Rug Pull and ensuing enshittification, where there are a dozen tiers of products, you need 4 of them, and they now cost $2000/month. I want to hedge against that.
Unless you are getting outside of your comfort zone and taking a month off from your $200 subscription, every other month, I can’t see how you can make the universal claim that the open weights models are all 50% as good. Just today, DeepSeek released a new model, so nobody knows how that will compare, a week ago it was Gemma 4, etc. I’m okay with you making a comparison, but state the model and the timeframe in which it was tested that you are basing your conclusions on.