logoalt Hacker News

Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

39 pointsby PaulHoule06/16/202517 commentsview on HN

Comments

albertzeyer06/16/2025

"hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs.

Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

Also note that there has been so much research on this already. While this particular approach might be novel, there has been attempts to avoid the O(n^2) complexity in self-attention basically almost since the original transformer paper came out in 2017. I wonder a bit that this paper does not cite xLSTM, or Block-Recurrent Transformers.

Also, this paper comes very short in experiments. There is basically only table 2. There is no study on length extrapolation (which is very relevant for the topic), or needle-in-haystack experiments, or scaling studies, any larger scale experiments, etc. Also, even in this main table 2, I see a couple of typos. And looking at the results in table 2, the improvements seems to be quite minor.

So I would conclude, this needs a lot more work.

show 3 replies
maxrmk06/16/2025

> While the specific internal workings of DeepSeek LLM are still being elucidated, it appears to maintain or approximate the self-attention paradigm to some extent.

Totally nonsensical. Deepseeks architecture is well documented, multiple implementations are available online.

yorwba06/16/2025

This paper seems rather unfocused, explaining their architecture three times with slight variations while managing to omit crucial details like how exactly they compute gradients for their "External Retrieval Memory."

Also, the section on DeepSeek is really weird: "While the precise architectural details of DeepSeek LLM are still emerging, early discussions suggest that it relies on an extended Transformer backbone or a "hybrid" approach that likely incorporates some form of attention-based mechanism, potentially at specific layers or across chunk boundaries, to facilitate information flow across large contexts." It makes it sound like a mystery, even though there have been multiple papers published on it (they cite the R1 one) so that there's really no need to guess whether attention is involved.

Overall I'm not convinced the authors know what they're doing.

show 2 replies
daxfohl06/16/2025

Partially related, is charging by token sustainable for LLM shops? If the compute requirements go up quadratically, doesn't that mean cost should as well?

show 1 reply
imranq06/16/2025

I like the idea of removing quadratic scaling for attention, this paper has thin experimental support. No real tasks tested beyond perplexity. Nothing on reasoning, retrieval QA, or summarization quality. Even in perplexity the gains are marginal.

However it removes attention so I think its worth watching that space of non-attention models

zoklet-enjoyer06/16/2025

I don't know what those words mean, but I am excited for the possibilities.

show 1 reply