logoalt Hacker News

searealistyesterday at 7:56 PM0 repliesview on HN

It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.