This one's interesting, and I think the next frontier for LLMs should really just be, how can we get something like Opus 4.6 to cost drastically less, for the same output? I say 4.6 because from 4.6 onwards it's been pretty darn good, at least for me, always feels like every model upgrade someone hates it, heck even 4.5 was fine.
> SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length.
> At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.
Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.
According to Subquadratic, Needle in a Haystack is strong up to 12m tokens, but RULER has not been tested above 128k tokens ??
It's been all talk and no action ever since their first announcement.
> SubQ 1.1 Small scores near-perfect at 1M, 2M, 6M, and 12M tokens. The model was trained predominantly at 1M tokens yet the retrieval held near perfectly at 12x that length, despite compressing attention to just 0.13% of relationships. This generalization is a direct consequence of SSA routing attention based on content relevance rather than fixed positional patterns.
If the results persists from 1M to 12M, why not 24M or 48M? Sounds almost too good to be true.
With back of the napkin math from inside my head, that'd be like 0.5/1 million LOC, depending on language/code density, could just fold the entire codebase into one prompt if it's a small one, that'd be neat :)
Comparing compute cost versus FlashAttention-2 is not very honest to me.
FlashAttention-2 is not used anymore for at least 2y.
This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.
There was, let's say, significant skepticism the last time they announced something. What's changed?
What kind of hardware would be needed to serve an instance with the full 12m context? And what kind of speeds can one expwct at those extremes at 10m+?
Disappointing they don't actually say how their sparse attention mechanism works.
They've done multiple "evaluations" by third parties, but still, it seems that they aren't being fully transparent. I think the approach is quite interesting and novel, but this feels like deja vu.
I get why they aren't disclosing all the details, but it seems more hype-train-esque to me for this moment. I don't disagree that this could be big.
I don’t understand why this lab is allergic to providing details on what they actually made, especially when Chinese labs are more than willing to share architectural specs/code/kernels (eg NSA/FSA, RAMBa, HISA, DSA LightningIndexer, etc). I don’t doubt that they’ve done something here, but the lack of details makes me default not trust this, particularly when this is the second time that they’ve released a “technical report” that just waxes poetic about the concept.