The "effective attention" framing nails what I keep noticing too. Sonnet's official context is huge in principle, but in a real coding session where the agent is reading 30+ files, running grep, processing test output, emitting diffs — somewhere around 60-80k effective tokens I can feel it start to "skim" earlier context rather than reason over it. The thing it forgot isn't out of window; it's just not weighted highly enough anymore.
The tool-call history collapse is a problem I'd pay real money to have solved cleanly. My crude manual version: keep the function calls but drop or summarize the responses for anything older than ~15 turns. Most of the "what was I doing" signal lives in the calls, not the outputs. Letting the model itself mark "I'm done with that thread, compress the responses" feels like the right abstraction, but I haven't seen anyone ship it well yet.
A per-model "compaction aggressiveness" knob in Forge could be interesting — the small-model effective-attention cliff might respond to earlier/heavier trimming.
Forge does have tiered compaction, and it's configurable! Defaults are currently probably a bit on the high side for catching effective attention, but that might be a part of the code that interests you the most.
src/forge/context/ - specifically TieredCompact in strategies.py. That's the furthest I took it. The tool-call collapse in particular has been useful in agentic coding, but I haven't formalized/generalized it yet. I think within forge it'll be a callable tool that will rely on the model knowing when to trigger it (as you said - "I'm done with the task, can collapse"). That's the part I need to abstract out of my bespoke implementation.
>The tool-call history collapse is a problem I'd pay real money to have solved cleanly.
It's general attention collapse and it happens everywhere once you start noticing it.
The simplest example, which even frontier models fail at, is something of the form `A and not B', which they keep insisting means `A and B' after the text gets pushed far enough back in the context.
The only solution, I think, that is even theoretically capable of fixing this is using a different form of attention. One which innately understands tree-like structures and binds tree nodes close together regardless of overall distance from the end of the stream.
Incidentally this is what I'm also working on at $job.