> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same...

teiferer • today at 6:06 AM • 18 replies • view on HN

> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.

And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.

Replies

zylepe • today at 11:39 AM

Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.

➕ show 3 replies

andai • today at 6:28 PM

I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.

I know how stupid that sounds but it's true.

Well what do they say... "If it sounds stupid but it works, then it's not stupid!"

Wowfunhappy • today at 11:26 AM

Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.

➕ show 2 replies

johnisgood • today at 7:28 AM

Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.

Certhas • today at 6:48 AM

There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...

hardwaregeek • today at 10:23 AM

Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.

➕ show 1 reply

bfrog • today at 1:09 PM

How do you measure the performance of people? This is subjective and biased every time.

tezza • today at 6:54 AM

It is possible to check for improvements. See for yourself:

https://generative-ai.review/2026/06/claude-fable-rush-test-...

As mentioned in another HN thread I've done a qualitative side-by-side measurements of Claude Fable vs Opus 4.8 vs ChatGPT 5.5.

Anyone is able to check the output for themselves and form a judgement.

Large visible improvements for Fable over Opus 4.8 and ChatGPT 5.5.

I recently did the same to show the progress from Opus 3.4/ChatGPT o3pro one calendar year ago.

➕ show 1 reply

contextfree • today at 6:57 AM

fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)

➕ show 1 reply

ElFitz • today at 7:37 AM

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.

➕ show 1 reply

torginus • today at 11:17 AM

Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed

vonneumannstan • today at 5:40 PM

The first thing in the release page is benchmark results...

https://www.anthropic.com/news/claude-fable-5-mythos-5

lqstuart • today at 12:54 PM

It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ

kmacdough • today at 9:47 AM

I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

"Don't make mistakes" does seem dumb. It's not guidance.

solumunus • today at 1:39 PM

Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.

alecco • today at 11:57 AM

> These comparisons are all gut feelings.

https://simonwillison.net/about/#disclosures

"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."

But I'm totally unbiased on my gut-feeling posts, trust me bro.

-- AI influencers.

➕ show 2 replies

alt Hacker News

Replies