logoalt Hacker News

tensortoday at 5:25 AM1 replyview on HN

No, distillation is far older than deepseek. Deepseek was impressive because of algorithmic improvements that allowed them to train a model of that size with vastly less compute than anyone expected, even using distillation.

I also haven’t seen any hard data on how much they do use distillation like techniques. They for sure used a bunch of synthetic generated data to get better at reasoning, something that is now commonplace.


Replies

MobiusHorizonstoday at 8:00 AM

Thanks it seems I conflated.