A bit skeptical about a 27B model comparable to opus...

amunozo • today at 2:15 PM • 7 replies • view on HN

Replies

For at least a year now, it has been clear that data quality and fine-tuning are the main sources of improvement for mediym-level models. Size != quality for specialized, narrow use cases such as coding.

It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs.

➕ show 1 reply

rubiquity • today at 3:51 PM

You should try it out. I'm incredibly impressed with Qwen 3.5 27B for systems programming work. I use Opus and Sonnet at work and Qwen 3.x at home for fun and barely notice a difference given that systems programming work needs careful guidance for any model currently. I don't try to one shot landing pages or whatever.

➕ show 1 reply

Aurornis • today at 3:43 PM

You should be skeptical. Benchmark racing is the current meta game in open weight LLMs.

Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it.

Impressive for the size, though!

jjcm • today at 3:44 PM

Opus 4.5 mind you, but I’m not too surprised given how good 3.5 was and how good the qwopus fine tune was. The model was shown to benefit heavily from further RL.

esafak • today at 3:12 PM

Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to?

➕ show 4 replies

wesammikhail • today at 3:05 PM

you'd be surprised how good small models have gotten. Size of the model isnt all that matters.

➕ show 3 replies

cmrdporcupine • today at 3:48 PM

A small model can be made to be "comparable to Opus" in some narrow domains, and that's what they've done here.

But when actually employed to write code they will fall over when they leave that specific domain.

Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge.

Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results.

alt Hacker News

Replies