I think this is doable for very long tail experts that get swapped in for specialised topics - say, orbital mechanics.
But for experts that light up at, say, 1% frequency per batch, you're doing an awful lot of transfers from DRAM which you amortize over a single token, instead of reads from HBM which you amortize over 32 tokens.