One of the big problems with Attention Mechanisms is that the Query needs to look over every single key, which for long contexts becomes very expensive.
A little side project I've been working on is to train a model that sits on top of the LLM, looks at each key and determines whether it's needed after a certain lifespan, and evicts it if possible (after the lifespan is expired). Still working on it, but my first pass test has a reduction of 90% of the keys!
Is this not similar to DeepSeek lighting indexer