> which is effectively what this model is doing.
That isn't what it's doing and it's not what distillation is.
The smaller models are distillations, they use the same architecture they were using before.
The compute required for Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B are identical.
In general I agree that this is a rapidly advancing space, but specifically:
> the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet
My point is that the words 'according to these benchmarks' is key here, because it's enormously unlikely (and this upheld by the reviews of people testing these distilled models), that:
> the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B) is stronger than Claude 3.5 Sonnet
So, if you have two things:
1) Benchmark scores
2) A model that clearly is not actually that enormously better from the distillation process.
Clearly, clearly, one of those two things is wrong.
Either:
1) The benchmarks are meaningless.
2) People are somehow too stupid to be able to evalulate the 8B models and they really are as good as Claude sonnet.
...
Which of those seems more likely?
Perhaps I'm biased, or wrong, because I don't care about the benchmark scores, but my experience playing with these distilled models is that they're good, but they're not as good as sonnet; and that should come as absolutely no surprise to anyone.
Another possible conclusion is that your definition of good, whatever that may be, doesn’t include the benchmarks these models are targeting.
I don’t actually know what they all are, but MATH-500 for instance is some math problem solving that Sonnet is not all that good at.
The benchmarks are targeting specific weaknesses that LLMs generally have from only learning next token prediction and instruction tuning. In fact, benchmarks show there are large gaps in some areas, like math, where even top models don’t perform well.
‘According to these benchmarks’ is key, but not for the reasons you’re expressing.
Option 3 3) It’s key because that’s the hole they’re trying to fill. Realistically, most people in personal usage aren’t using models to solve algebra problems, so the performance of that benchmark isn’t as visible if you aren’t using an LLM for that.
If you look at a larger suite of benchmarks, then I would expect them to underperform compared to sonnet. It’s no different than sports stats where you can say who is best at one specific part of the game (rebounds, 3 point shots, etc) and you have a general sense of who is best (eg LeBron, Jordan), but the best players are neither the best at everything and it’s hard to argue who is the ‘best of the best’ because that depends on what weight you give to the different individual benchmarks they’re good at. And then you also have a lot of players who are good at doing one thing.