logoalt Hacker News

SwellJoetoday at 6:30 AM2 repliesview on HN

Early on, I had a vague suspicion that the reason some of the Chinese models, including quite small ones, perform so well on this task, especially relative to their size and cost, is because they don't have the same safety guardrails baked in regarding software security that US models seem to have. Gemini 3.1 Pro doing so poorly sort of reinforced that gut feeling.

But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).

So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.


Replies

scorpioxytoday at 7:40 AM

Can you elaborate on the "software security that US models" seem to have? According to blog posts I read, the code generated had security problems and naive ones at that. Perhaps it got better now or people have learned not to blindly vibe code applications that are to be used publicly but it certainly didn't feel like there were security guardrails.

show 1 reply
coldteatoday at 7:51 AM

>But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes.

Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?

show 1 reply