It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so...

searealist • yesterday at 7:56 PM • 0 replies • view on HN

It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.

alt Hacker News