As I posted in another comment, I found Fable to be substantially more powerful than any previous model. However, this isn't just an ungrounded opinion - I uploaded my full session transcript and code created working on a very complex implementation, so people can judge for themselves, if they're interested: https://tossrock.substack.com/p/36-hours-with-fable
Around February, Opus 4.6 was excellent. Smart, fast, proactive. Then it got lobotomized and it's never been the same after that nerf. 4.7 came along and it too was disappointing—not unlike 4.8, which despite feeling a smidge smarter, tends to write word salad and is basically unusable for some workflows.
Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...
It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time
> And, all of the bugs can be identified by several models if they are pointed directly at it and told what to look for.
This made me think, well, sure, if you tell them what to look for... but then:
> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.
So okay, the first one was an accidental mis-statement?
From all the things I read I'm pretty convinced that Mythos is just standard LLM with safety features turned off. If current models weren't reluctant to search for vulnerabilities, they might perform as good as Mythos.
For malware detection, many models are biased for or against detecting a threat (likely a thing that can be adjusted with a prompt).
I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/
Fable was able to oneshot pretty big features. In write spec -> refine spec -> create todos -> implement todos workflow difference was far less pronounced vs codex or opus.
In my brief experience, the difference between fable and opus is largely in persistence, not global intelligence like you might expect. Fable just... goes the extra mile, sometimes in a scary way.
I thought the whole point was that it doesn’t need to be pointed at the problem. That’s a much easier problem to solve. Also you eliminate 10000 false positives.
What makes mythos special is the fact that someone with zero expertise in the field could find and weaponize a zero-day. Real threat actors already use llms em masse and the recent advancements with glm-5.2 will probably enable way more cyber attacks than fable ever could.
Spatial reasoning is where fable really separates itself imo
Surprise.. someone downplaying Mythos/Fable that didn't actually use it. Plenty of comments here to the contrary, including my own personal experience with Fable was easily a step change in capability over Opus - figuring things out in reverse engineering binaries that Opus plain couldn't find.
Frankly after testing out Fable last week, it was just a bigger sink of tokens than anything else. The amount of tokens consumed by it wasn't worth the steps it saved me compared to using opus 4.8.
This just shows that Google needs to double down on its AI models fast. Even open source chinese models are beating 3.1 Pro and 3.5.Flash in almost everything.
Is the title a reference to "will it blend"?
The benchmark fills an interesting niche, but the methods need work considering how many caveats are included in the results.
Gemini / antigravity didn't use to be this hamstrung. Something recently changed within the past couple months that makes doing security work very difficult to do. Even auditing/securing your own code now requires an insane amount of prompt engineering that is utterly ridiculous and did not use to be required.
Yesterday I wanted to delete records from a database in my own ssh server. It refused to do so. No matter what I prompted. Very annoying.
事実は小説よりも奇なり
Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.
A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.
But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.
Wild that Amodei's blog and pod circuit are the greatest IPO risk.
[flagged]
[dead]
I don't understand the article.
"I’d say this benchmark answers with a resounding, “Maybe.”
Mythos maybe really is better than the other current models at finding security bugs"
Yet in the results, I don't see Mythos?
It seems like a really well researched article with lots of results for other models, yet the title seems to be clickbait because the results don't contain Mythos, do they?
Fable was the only model that was able to detect a data corruption bug in my Qt C++ note-taking app[1] that all other tested models (gpt-5.5 xhigh, GLM-5.1, Kimi 2.7, DeepSeek V4 Pro) didn't find. I'll test on GLM-5.2 and Mimo v2.5 Pro soon.
[1] https://www.get-notes.com