LLMs can look back over a certain number (N) of tokens, which roughly correspond to words. For instance if you want to summarize or answer questions about a document accurately the length of the document has to be less than N.
Conventionally they use an attention mechanism that compares every token to every other token which has a cost of N*N or N squared which is quadratic. If you want LLMs to chew over a huge amount of context (all the source code for your project) it’s a problem so people are looking for ways around this.
Not even that. With KV-caching, it's linear with the size of the context; and if someone figured out a way to have e.g. NlogN complexity, I imagine with KV-caching it may go down to logN complexity. (If the new algorithm permits that.)
Thank you for that explanation