i’ve often thought that less than one second is all you need.One of my fun super powers when someone asks what i’d like to have is 1 second ahead of everyone else- that’s all i need. i honest don’t know where the distillation conversation is at. is it real, is it ongoing? i think that aspect would big one. Your point is valid if it’s valid. i’m not a great global citizen, you know, lots going on out and about.
A lot of distillation happens. E.g. OLMo models have a completely open dataset and they are heavily distilled. It only makes sense to try to absorb behaviors from the best models out there. That said, I think the open weight juggernaughts are doing really genuinely great work with RL, training environments, architectural innovations etc.