It’s somewhere in between, really. This is a rapidly advancing space, so to some degree, it’s expect...

deepsquirrelnet • 01/20/2025 • 1 reply • view on HN

It’s somewhere in between, really. This is a rapidly advancing space, so to some degree, it’s expected that every few months, new bars are being set.

There’s also a lot of work going on right now showing that small models can significantly improve their outputs by inferencing multiple times[1], which is effectively what this model is doing. So even small models can produce better outputs by increasing the amount of compute through them.

I get the benchmark fatigue, and it’s merited to some degree. But in spite of that, models have gotten really significantly better in the last year, and continue to do so. In some sense, really good models should be really difficult to evaluate, because that itself is an indicator of progress.

[1] https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...

Replies

noodletheworld • 01/20/2025

> which is effectively what this model is doing.

That isn't what it's doing and it's not what distillation is.

The smaller models are distillations, they use the same architecture they were using before.

The compute required for Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B are identical.

In general I agree that this is a rapidly advancing space, but specifically:

> the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet

My point is that the words 'according to these benchmarks' is key here, because it's enormously unlikely (and this upheld by the reviews of people testing these distilled models), that:

> the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B) is stronger than Claude 3.5 Sonnet

So, if you have two things:

1) Benchmark scores

2) A model that clearly is not actually that enormously better from the distillation process.

Clearly, clearly, one of those two things is wrong.

Either:

1) The benchmarks are meaningless.

2) People are somehow too stupid to be able to evalulate the 8B models and they really are as good as Claude sonnet.

...

Which of those seems more likely?

Perhaps I'm biased, or wrong, because I don't care about the benchmark scores, but my experience playing with these distilled models is that they're good, but they're not as good as sonnet; and that should come as absolutely no surprise to anyone.

➕ show 1 reply

alt Hacker News

Replies