logoalt Hacker News

amunozotoday at 2:15 PM7 repliesview on HN

A bit skeptical about a 27B model comparable to opus...


Replies

originalvichytoday at 3:13 PM

For at least a year now, it has been clear that data quality and fine-tuning are the main sources of improvement for mediym-level models. Size != quality for specialized, narrow use cases such as coding.

It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs.

show 1 reply
rubiquitytoday at 3:51 PM

You should try it out. I'm incredibly impressed with Qwen 3.5 27B for systems programming work. I use Opus and Sonnet at work and Qwen 3.x at home for fun and barely notice a difference given that systems programming work needs careful guidance for any model currently. I don't try to one shot landing pages or whatever.

show 1 reply
Aurornistoday at 3:43 PM

You should be skeptical. Benchmark racing is the current meta game in open weight LLMs.

Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it.

Impressive for the size, though!

jjcmtoday at 3:44 PM

Opus 4.5 mind you, but I’m not too surprised given how good 3.5 was and how good the qwopus fine tune was. The model was shown to benefit heavily from further RL.

esafaktoday at 3:12 PM

Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to?

show 4 replies
wesammikhailtoday at 3:05 PM

you'd be surprised how good small models have gotten. Size of the model isnt all that matters.

show 3 replies
cmrdporcupinetoday at 3:48 PM

A small model can be made to be "comparable to Opus" in some narrow domains, and that's what they've done here.

But when actually employed to write code they will fall over when they leave that specific domain.

Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge.

Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results.