logoalt Hacker News

archontoday at 12:28 PM2 repliesview on HN

I'm uneducated on how distillation works at more than a basic level so forgive me if this is a stupid question.

Isn't "distillation" of another provider's model exactly how these models got training date in the first place: Massive amounts of the written word + Prompt -> Answer. Why wouldn't distillation produce similar "reasoning" in the new model? It's just inputs and outputs.


Replies

maxbondtoday at 1:01 PM

What you're describing is (pre-)training. Distillation requires richer labels, the probability distribution over tokens (it would be logits rather than probabilities but that's not important). From a chat transcript you can only understand the argmax/most likely token of that distribution (and only if the API allows you to set the temperature to 0). It's not impossible for an API to give you that but they won't if they don't want you distilling their models.

The intuition is that distillation exploits not only the "right" answer but the relationship between answers (what's the second most right answer? the third? etc).

zozbot234today at 12:31 PM

Among other things, because you simply can't get those "massive amounts" of text from a SOTA model at reasonable cost. And complex reasoning cannot possibly be trained in a pure one-shot fashion, real post-training takes massive resources. The whole story doesn't add up.