I think this is doable for very long tail experts that get swapped in for specialised topics - say, ...

reitzensteinm • today at 12:46 AM • 0 replies • view on HN

I think this is doable for very long tail experts that get swapped in for specialised topics - say, orbital mechanics.

But for experts that light up at, say, 1% frequency per batch, you're doing an awful lot of transfers from DRAM which you amortize over a single token, instead of reads from HBM which you amortize over 32 tokens.

alt Hacker News