The ahmadyan comparison is fair. Meta's CWM models hitting 65% vs SERA's 54% is a meaningful gap.
But the interesting number here isn't accuracy. It's the $400 to reproduce top open-source performance. That's the part that matters for teams building internal tooling.
We've been running agents on proprietary codebases at work. The pain isn't model quality. It's customization. Most off-the-shelf agents don't understand your repo structure, your conventions, your test patterns. If you can fine-tune a 32B model on your own codebase for a few hundred dollars, that changes the economics completely.
But codebases changes everyday, so finetuning will have to be continuously done!
Probably not worth it versus something like Claude Code.
Curious whether anyone's tried this on non-Python codebases. Most SWE-Bench stuff is Python-heavy.
The fine-tuning overhead is definitely a factor, but for smaller shops the hard constraint is usually inference VRAM. Running a 32B model locally or on a rented GPU is surprisingly expensive if you aren't saturating it. Even at 4-bit quantization you are looking at dual 3090s or an A6000 to get decent tokens per second. The $400 training cost is impressive but the hosting bill is what actually kills the margin compared to per-token APIs.