> I think there is a moderately large market for models like this that aren’t quite SOTA level but can be served up much cheaper.
There isn't, pretty much everyone wants the best of the best.
No. Right now I'm upset that Google has removed (or at least is in the process of removing) the Gemini 2.0 flash model. We use it for some pretty basic functionality because it's cheap and fast and honestly good enough for what we use it for in that part of our app. We're being forced to "upgrade" to models that are at least 2.5 times as expensive, are slower and, while I'm sure they're better for complex tasks, don't do measurably better than 2.0 flash for what we need. Yay. We've stuck with the GCP/Gemini ecosystem up until now, but this is kind of forcing us to consider other LLM providers.
> There isn't, pretty much everyone wants the best of the best.
For direct user interaction or coding problems, perhaps. But as API calls get cheaper, it becomes more realistic to use them for completely automated workflows against data-sets, or as sub-agents called from expensive SOTA models.
For example, in Claude, using Opus as an orchestrator to call Sonnet sub-agents, is a popular usage "hack." That only gets more powerful, as the Sonnet equivalent model gets cheaper. Now you can spawn entire teams of small specialized sub-agents with small context windows but limited scope.
Ever hit your daily limit on Claude Code and saw how expensive it is to pay per token?
maybe there isnt, but as understanding grows people will understand that having an orchestration agent delegate simple work to lesser agents is significant not only for cost savings, but also for preserving context window space.
For coding I want the best. Both I and $work do lots of things besides coding where smaller models like qwen3.5-27b work great, at much lower cost.
That isn't true. In a Codex or Claude Code instance, sure... but those are not the main users of APIs. If you are using LLMs in a service for customers, costs matter.
The market for API tokens is bigger than people like you and I (who also want the best) using then for code.
There are a lot of data science problems that benefit from running the dataset through an LLM, which becomes bottlenecked on per-token costs. For these you take a sample subset and run it against multiple providers and then do a cost versus accuracy tradeoff.
The market for API tokens is not just people using OpenCode and similar tools.
Nope. I get very good results from GLM 5 and 5.1. I’m not working on anything so complex and groundbreaking that I need the best.
Coding is a rung on the ladder of model capability. Frontier models will grow to take on more capabilities, while smaller more focused models start becoming the economical choice for coding
Everyone may want the best, but the amount of AI-addressable work outstrips the budget available for buying the best by quite a wide margin.
OpenCode allows for free inference tho.
Not really. It depends on the usecase. For private stuff I'm very happy to take what was SOTA a year or 2 ago if I can have it all running in my home and don't have to share any of my data with some sleazy big tech cloud.
The price is a concern too of course. But privacy is a bigger one for me. I absolutely don't trust any of their promises not to use data for training purposes.
That's only because current models don't saturate people's needs. Once they are fast and smart enough people will pick cheaper ones.
The OpenRouter usage stats indicate the opposite: https://openrouter.ai/rankings?view=month