This is the key piece > Full AttnRes is straightforward but requires O(Ld) memory at scale. Blo...

jryio • yesterday at 8:15 PM • 0 replies • view on HN

This is the key piece

> Full AttnRes is straightforward but requires O(Ld) memory at scale. Block AttnRes partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.

alt Hacker News