it made sense to me as it is a very simple idea I guess: causal self-attention compute QKV distances...

tuned • today at 6:43 AM • 0 replies • view on HN

it made sense to me as it is a very simple idea I guess: causal self-attention compute QKV distances computing on the full vectors for Q,K and V; the topological transformer can provide the same computation using Q, scalar K and V. Instead of [N², N², N²] -> [N², N, N²] is used. If generation is confirmed to be on par in terms of quality, the gains are evident.

alt Hacker News