logoalt Hacker News

boroboro406/16/20250 repliesview on HN

> Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

This is incorrect in case of batched inference. There are two bottlenecks at play: compute and memory, and your reasoning applies to compute. In case of memory it gets trickier: for MLP layers you’ll need to read same set of weights for all elements of your batch, while for kv cache for attention elements will be different. That’s why in practice the real length where attention dominates would be closer to model dimension / batch size, rather than just model dimension. And this number isn’t as high anymore.