There's a chance this memory problem is not going to be that easy to solve. It's true context lengths have gotten much longer, but all context is not created equal.
There's like a significant loss of model sharpness as context goes over 100K. Sometimes earlier, sometimes later. Even using context windows to their maximum extent today, the models are not always especially nuanced over the long ctx. I compact after 100K tokens.
From my experience context window by itself tells half the story. You load a big document that’s 200k tokens and ask it a question, it will answer just fine. You start a conversation that soon enough balloons past 100k then it starts losing coherence pretty quickly. So I guess batch size plays a more significant role.
I'm over simplifying here but graph database and knowledge graphs exist. An LLM doesn't need to preserve everything in context, just what it needs for that conversation.
Context will need to go in layers. Like when you tell someone what you do for a living, your first version will be very broad. But when they ask the right questions, you can dive into details pretty quick.
But you don't have to hold the entire memory in context. You just need to perfect techniques to pull in parts of the context that you need. This can be done via RAG, multi-agent architectures, etc. It's not perfect but it will get better over time.