This argument works better for state space models. A transformer would still steps context one token...

rar00 • yesterday at 12:33 PM • 1 reply • view on HN

This argument works better for state space models. A transformer would still steps context one token at a time, not maintain an internal 1e18 state.

Replies

mgraczyk • yesterday at 4:52 PM

That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not

alt Hacker News

Replies