Technical feedback:
Every single announcement, like compression needs the addition of the lower limits of machine requirements. if a 64Gb model is compressed 224x times, should that not be able to be run on a 292mb video card?
That's exactly what I was trying to infer from the abstract which sadly doesn't explicitly calls out memory requirements. I assume it increases inference time by getting rid of transformers. What's the memory requirements then ?
Edit: they claim these somewhere in the doc:
> Memory
Teacher model: multi-GB (entire model must be loaded)
AN1 head: a few MB (only head needed after training)
I find the claims surreal, can't wait for someone to validate this or I will do it myself. It would have been handy to upload such "few MB" weight file distilled off llama 70B so that we can see for ourself the 220x inference and in memory model size compression is true.
That's exactly what I was trying to infer from the abstract which sadly doesn't explicitly calls out memory requirements. I assume it increases inference time by getting rid of transformers. What's the memory requirements then ?
Edit: they claim these somewhere in the doc:
> Memory Teacher model: multi-GB (entire model must be loaded) AN1 head: a few MB (only head needed after training)
I find the claims surreal, can't wait for someone to validate this or I will do it myself. It would have been handy to upload such "few MB" weight file distilled off llama 70B so that we can see for ourself the 220x inference and in memory model size compression is true.