logoalt Hacker News

ACCount37yesterday at 1:33 PM2 repliesview on HN

Wrong, mostly.

Model capability is a function of model size. Raising the bar raises model performance in every domain.

An "idiot savant" model that's overtrained for a specific domain would beat a generalist model of the same size. But scale the generalist up enough, and it'll trounce the specialist. Removing poetry data from a model training mix doesn't give you much - it might even cost you some performance - and "idiot savant" approach of overtraining for a domain has a hard ceiling.

So far, it seems like there's some equivalent of "g factor" in LLMs - a broad "intelligence" value that performance across many diverse domains correlates with. And, as a rule, larger models have more of it.


Replies

everforwardyesterday at 4:06 PM

While I disagree with OP about removing stuff from the model, there’s a valid question about tradeoffs between intelligence and price.

Deepseek Flash is almost certainly wrong more often than Opus or Fable. It also costs like 5% as much.

The question becomes if I run Deepseek in a loop to fix the mistakes it made that Opus/Fable didn’t, can it fix its own bugs in few enough tokens that it’s still cheaper?

So far, the answer seems to be “yes, by a significant margin”. A lot of tasks are simple enough that both Deepseek and Opus or Sonnet can one-shot it, which is a huge cost win for Deepseek. Even on the long tail, it’s usually like 4x the tokens on Deepseek which is still way cheaper than Opus.

There are things that Opus can do that Deepseek just won’t ever really nail, but it happens so infrequently that I just don’t worry. Like most people, most of what I do is the same sort of “3 tier app with a React frontend” that doesn’t take a rocket scientist to work out.

overfeedyesterday at 2:27 PM

> Wrong, mostly.

> Model capability is a function of model size

Model effectiveness has improved across model sizes. You really should try the latest flash variants more. They have become my default for most tasks except for gnarly high-level planning.

show 2 replies