Thanks for the details. What's a second generation agent?
You mentioned the workflow is heavy on specs and tests. The smaller models seem to be really good at following instructions now. (Well, some of them!)
So that's probably part of why you're seeing good results. It has a very clear target.
Whereas with more open ended instructions they seem to struggle more. I think common sense is the main thing you get with model size.
When I'm working with the big models I feel like I don't have to spell things out so much. The gap is closing, but I'm assuming there is some fundamental limit there based on the size.
Of course the ideal would be Mythos, running for free, in my house, at 1,000 tok/s ;) Someday...