> Drops compute required for training by ~20%. This is not true. Authors claim that w.r.t. trai...

dvt • yesterday at 7:45 PM • 1 reply • view on HN

> Drops compute required for training by ~20%.

This is not true. Authors claim that w.r.t. training, their method adds negigible overhead for AttnRes with no memory impact (but is way more complicated for Block AttnRes since we need to use pipelining for larger models, hence the O(Ld) & O(Nd) figures, with N ≪ L).

> WAY lower bandwidth requirements for inference.

Also not true. Paper has nothing to do with inference, apart from the benchmarks. If you're looking at the graph about "compute advantage," it's about training compute. They do some interpolation to get to the 1.25x number, basically answering the question "if non-AttnRes architecture were trained, how much compute would it take to get to the same loss as AttnRes?" (The answer being ~20% more compute.) It's an interesting claim, but there's all kinds of weird and unexpected convergence that can happen, so take it with a grain of salt.

Replies

observationist • yesterday at 9:15 PM

I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

If model A reaches performance level 100 using 100 units of compute using old methods, and you train model B using AttnRes, aiming at performance level 100, it costs you 80 units of compute.

It probably doesn't map precisely, but that's where people are diverging from the claim - it doesn't explicitly say anything about reduced inference or training time, but that's the implicit value of these sorts of things. Less compute to equivalent performance can be a huge win for platforms at scale as well as for local models.

➕ show 1 reply

alt Hacker News

Replies