logoalt Hacker News

cubefoxtoday at 3:04 PM13 repliesview on HN

> I think there is a moderately large market for models like this that aren’t quite SOTA level but can be served up much cheaper.

There isn't, pretty much everyone wants the best of the best.


Replies

PhilippGilletoday at 3:19 PM

The OpenRouter usage stats indicate the opposite: https://openrouter.ai/rankings?view=month

show 2 replies
thraxiltoday at 4:10 PM

No. Right now I'm upset that Google has removed (or at least is in the process of removing) the Gemini 2.0 flash model. We use it for some pretty basic functionality because it's cheap and fast and honestly good enough for what we use it for in that part of our app. We're being forced to "upgrade" to models that are at least 2.5 times as expensive, are slower and, while I'm sure they're better for complex tasks, don't do measurably better than 2.0 flash for what we need. Yay. We've stuck with the GCP/Gemini ecosystem up until now, but this is kind of forcing us to consider other LLM providers.

show 2 replies
Someone1234today at 3:15 PM

> There isn't, pretty much everyone wants the best of the best.

For direct user interaction or coding problems, perhaps. But as API calls get cheaper, it becomes more realistic to use them for completely automated workflows against data-sets, or as sub-agents called from expensive SOTA models.

For example, in Claude, using Opus as an orchestrator to call Sonnet sub-agents, is a popular usage "hack." That only gets more powerful, as the Sonnet equivalent model gets cheaper. Now you can spawn entire teams of small specialized sub-agents with small context windows but limited scope.

show 3 replies
joefouriertoday at 3:17 PM

Ever hit your daily limit on Claude Code and saw how expensive it is to pay per token?

sidrag22today at 3:11 PM

maybe there isnt, but as understanding grows people will understand that having an orchestration agent delegate simple work to lesser agents is significant not only for cost savings, but also for preserving context window space.

wongarsutoday at 3:54 PM

For coding I want the best. Both I and $work do lots of things besides coding where smaller models like qwen3.5-27b work great, at much lower cost.

scoopdewooptoday at 3:08 PM

That isn't true. In a Codex or Claude Code instance, sure... but those are not the main users of APIs. If you are using LLMs in a service for customers, costs matter.

Aurornistoday at 3:09 PM

The market for API tokens is bigger than people like you and I (who also want the best) using then for code.

There are a lot of data science problems that benefit from running the dataset through an LLM, which becomes bottlenecked on per-token costs. For these you take a sample subset and run it against multiple providers and then do a cost versus accuracy tradeoff.

The market for API tokens is not just people using OpenCode and similar tools.

wolttamtoday at 3:51 PM

Nope. I get very good results from GLM 5 and 5.1. I’m not working on anything so complex and groundbreaking that I need the best.

Coding is a rung on the ladder of model capability. Frontier models will grow to take on more capabilities, while smaller more focused models start becoming the economical choice for coding

regularfrytoday at 3:16 PM

Everyone may want the best, but the amount of AI-addressable work outstrips the budget available for buying the best by quite a wide margin.

noman-landtoday at 3:23 PM

OpenCode allows for free inference tho.

wolvoleotoday at 5:08 PM

Not really. It depends on the usecase. For private stuff I'm very happy to take what was SOTA a year or 2 ago if I can have it all running in my home and don't have to share any of my data with some sleazy big tech cloud.

The price is a concern too of course. But privacy is a bigger one for me. I absolutely don't trust any of their promises not to use data for training purposes.

esafaktoday at 3:50 PM

That's only because current models don't saturate people's needs. Once they are fast and smart enough people will pick cheaper ones.