> I've seen this claimed, but I'm not sure it's been true for my use cases?
I'd be surprised if it isn't true for your use cases. If you give GLM-5.1 and Optus 4.6 the same coding task, they will both produce code that passes all the tests. In both cases the code will be crap, as no model I've seen produces good code. GLM-5.1 is actually slightly better at following instructions exactly than Optus 4.6 (but maybe not 4.7 - as that's an area they addressed).
I've asked GLM-5.1 and Opus 4.6 to find a bug caused by a subtle race condition (the race condition leads to a number being 15172580 instead of 15172579 after about 3 months of CPU time). Both found it, in a similar amount of time. Several senior engineers had stared at the code for literally days and didn't find it.
There is no doubt the models do vary in performance at various tasks, but we are talking the difference between Ferrari vs Mercedes in F1. While the differences are undeniable, this isn't the F1. Things take a year to change there. The performance of the models from Anthropic and OpenAI literally change day by day, often not due to the model itself but because of the horsepower those companies choose to give them on the day, or them tweaking their own system prompts. You can find no end of posts here from people screaming in frustration the thing that worked yesterday doesn't work today, or suddenly they find themselves running out of tokens, or their favoured tool is blocked. It's not at all obvious the differences between the open-source models and the proprietary ones are worse than those day to day ones the proprietary companies inflict on us.
> In both cases the code will be crap, as no model I've seen produces good code.
I'm wondering if you have actually used claude code because results are not so catastrophic as you describe them.