And even if you have enough VRAM to fit the entire thing, inference speed after the first token is p...

wongarsu • yesterday at 2:54 PM • 0 replies • view on HN

And even if you have enough VRAM to fit the entire thing, inference speed after the first token is proportional to (activated parameters)/(vram bandwidth)

If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if

alt Hacker News