VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

164 points • by timhigins • today at 2:01 AM • 59 comments • view on HN

Comments

There is some base level of intelligence any model needs to be useful, even in narrow tasks.

Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.

➕ show 2 replies

secretslol • today at 4:32 AM

Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.

➕ show 3 replies

gslepak • today at 3:54 AM

Note that these is Python-only results, the model will not do as well with other languages.

I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

NotSuspicious • today at 5:16 AM

The interesting thing about models this small is they should be able to be put on a single Taalas chip (the HC1 already runs a Llama 3.1 8B model). We're already at the point where half-decent reasoning could be run on an ASIC (and at mind-boggling speeds).

➕ show 1 reply

noperator • today at 3:32 AM

Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.

➕ show 1 reply

aero2146 • today at 3:11 AM

I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...

➕ show 3 replies

SwellJoe • today at 4:12 AM

It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).

https://swelljoe.com/post/will-it-mythos/

➕ show 1 reply

scotty79 • today at 8:10 AM

If you could pair it somehow with a model that can code and describe code this could be a very powerful combo.

zkmon • today at 6:23 AM

Does python coding depend on political facts of the world?

It might appear not, but actually, the process of reasoning is not an isolated act. The right and wrong way of doing things is codified in social evolution that absorbed all facets of life. Why should you optimize a piece of code for performance? Why performance is needed? What is a bug? What features and UI themes would be more intuitive for humans?

There is a butterfly effect. Everything affects everything to some extent.

➕ show 1 reply

anonyfox • today at 6:56 AM

Wake me up when it does OCaml fine.

jkwang • today at 8:02 AM

[flagged]

sosojustdo • today at 3:08 AM

[flagged]

riponcm • today at 4:39 AM

[dead]

alt Hacker News

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

Comments