logoalt Hacker News

reitzensteinmtoday at 12:46 AM0 repliesview on HN

I think this is doable for very long tail experts that get swapped in for specialised topics - say, orbital mechanics.

But for experts that light up at, say, 1% frequency per batch, you're doing an awful lot of transfers from DRAM which you amortize over a single token, instead of reads from HBM which you amortize over 32 tokens.