logoalt Hacker News

yorwbatoday at 7:04 AM1 replyview on HN

This architecture does not allow later layers to directly query KV data from earlier layers. Each iteration of the loop uses the same layer parameters, so the KV data in later layers may well end up being the same, but only if the model stops changing it in response to other tokens in the context. Which is also something a traditional multi-layer transformer could do. (But might not end up doing due to lack of corresponding inductive bias.)

None of this helps with the strawberry problem, where the very first layer already gets a tokenized representation, so there is no layer that "actually perceives those Rs."


Replies

cainxinthtoday at 1:36 PM

Is it fair to say that the “Rs in strawberry problem” will not be “cleanly” solved unless we advance beyond tokenization?

show 1 reply