Yeah, burying this on page 8 is a bit suspect imo (the eval datasets are listed on page 3, so if you were familiar with them you would have a hint then).
The distillation of a student that predicts "anchor layers" and then acts as a backbone for classification is perfectly cool on its own; no need to stretch the title/abstract so much.
agreed re: title/abstract stretching. good work stands on its own without needing hype. "we found a nifty way to distill llama-70b using a much smaller student transformer model; the key is using intermediate activation layers in a compressed representation" would be about as effective at selling it while being more immediately approachable IMO