Yes it does! I haven't published those evals yet, but I'm actually running 24-35B class models on a custom coding harness built on forge (even 120B class recently).
I just need more GPU wall clock time to get more evals done. ETA is...a few weeks? Got distracted by the coding harness.
But the results are the same. Reforged models do better than bare, even at those sizes. As for published results, I ran forge on Anthropic models and reforged doe better than bare for them as well :)
If it's worth it to you, you could try running it on Deepseek v4 flash which is very cheap right now...
Exactly what I was thinking - even on frontier or near-frontier models I still see my agents get stuck in these pointless loops where it's very obvious to me what they need to do to get "unstuck".
>But the results are the same. Reforged models do better than bare, even at those sizes
>I haven't published those evals yet
Don't forget to post the complete settings for those evals, please, because local LLMs' failure modes are often caused by incorrect setups (bad quants, bad chat templates, non-recommended temperatures, ridiculously small context, not enabling "preserve thinking" etc.). In my setup I've never seen Qwen3.6-27b get truly stuck so far. What it usually gets wrong are poor architectural decisions or forgetting to update something.