How long? We already have that. Qwen3.6 have 35b/27b models that beat chatgpt4o. You can run them at home in one GPU. DeepSeekV4 just came up with a new way to have super long context with KV cache an order of magnitude smaller than before. It's already going on!
I've been experimenting with running a few models for local inference, some of them get "stuck" in a repeat loop of trying the same thing endlessly, its weird. Others are really good. If they can ever handle about 400k tokens (maybe less, but from experience with Claude after the 1 million token increase this seemed to be a good sweet spot) without going batcrap crazy I'll be impressed, mostly because I would like them to read more of the codebase instead of just making assumptions. Although I've been building a custom harness, and I'm just about to start working on the tool building features for the harness. I already have a system similar to what Beads does but I didn't like some things about Beads so I made my own to track tasks, so context window doesnt need to be super massive for task tracking.