logoalt Hacker News

zambelliyesterday at 11:33 PM1 replyview on HN

That's where frontier pulls ahead for sure, at least on the big frontier models - though I haven't formalized those findings because...time.

Necessary disclaimer, forge isn't concerned, technically, with model quality, just execution of tool calls. Now for the actual answer...

What I found to be the limiting factor with small models in the 14B range was "effective attention". Beyond a certain point, still well within their training context window size, I start to see degradation. I don't have hard numbers for it, but that's where an Opus and the like can just keep going for ages. I did come up with a tool call message history collapse that I might dogfood into forge one day (effectively clean up the message history intelligently so the model doesn't lose track as easily).

That being said, my coding eval suite for my agentic coding harness does have some refactor tasks and feature additions (everything is done on an actual sandboxed repo) and the small models can knock out those tasks even while pushing the 50-60 tool call mark. But I wouldn't trust them to do more than 1 of those in the same session.


Replies

jonnyasmaryesterday at 11:44 PM

The "effective attention" framing nails what I keep noticing too. Sonnet's official context is huge in principle, but in a real coding session where the agent is reading 30+ files, running grep, processing test output, emitting diffs — somewhere around 60-80k effective tokens I can feel it start to "skim" earlier context rather than reason over it. The thing it forgot isn't out of window; it's just not weighted highly enough anymore.

The tool-call history collapse is a problem I'd pay real money to have solved cleanly. My crude manual version: keep the function calls but drop or summarize the responses for anything older than ~15 turns. Most of the "what was I doing" signal lives in the calls, not the outputs. Letting the model itself mark "I'm done with that thread, compress the responses" feels like the right abstraction, but I haven't seen anyone ship it well yet.

A per-model "compaction aggressiveness" knob in Forge could be interesting — the small-model effective-attention cliff might respond to earlier/heavier trimming.

show 3 replies