logoalt Hacker News

anima-coretoday at 9:51 AM0 repliesview on HN

The memory story is actually much simpler than it looks.

The teacher still has to be loaded at training time, so the footprint is whatever the original model uses. Again, the compression doesn't shrink the teacher. It produces a small student head. After training, the teacher is no longer needed and the student runs by itself. That's why the inference footprint drops to a few MB.

It doesn't increase inference time at all. It removes transformers entirely from the inference path. The student computes directly on the layer-1 field, which is why it's so small and so fast.

On the request for a distilled “few MB” head for Llama 70B,that part is already reproducible right from the repo. The head is always task specific, not a general LLM, so uploading a single checkpoint wouldn't tell the whole story. The better path is to run the extraction script and train the head for any task you want. The pipeline is fully open, end to end. I'm looking for people to validate it independently.

If you need anything else cleared up, just let me know.