I wish people would stop using Anthropics incorrect use of the term distill. They don’t share logits so you can’t distill. You can generate training data, which doesn’t sound nearly so scary.
Why do you need logits to distill? Those are at least tokenizer-dependent, and different models use different tokenizers.
Why do you need logits to distill? Those are at least tokenizer-dependent, and different models use different tokenizers.