I'm pretty baffled by their choice of axes. I would have thought that the left was the cheapest...

tekacs • today at 6:27 AM • 6 replies • view on HN

I'm pretty baffled by their choice of axes. I would have thought that the left was the cheapest, not the most expensive. I appreciate that this layout means that top right can be best, but it's still unintuitive to have this backwards cost axis IMO.

Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of, and I have had to keep Opus on max for things that need 'real validation' for a while now. And that has felt like 'the only way' to get Opus to perform even close to 5.5 xhigh. I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.

The difference is that 5.5 xhigh is extremely fast in most practical cases, both efficiently implementing _overall_, and responding very quickly with great adaptive thinking if you ask it something that it doesn't have to think about. Opus 4.8 Max will needlessly chew on everything and can take hours to implement even simple things, so I can mostly only use it for planning/review.

Fable is much much better at adaptive thinking / responding quickly (although probably still worse than 5.5 xhigh), and... I think folks have said enough elsewhere about its strengths and weaknesses. Sadly still not a reliable implementor for my hard tasks though (that's still GPT's domain) – it tends to leave big, dangerous holes hiding inside implementations unless babied.

Replies

andai • today at 1:38 PM

>it tends to leave big, dangerous holes hiding inside implementations unless babied.

A brainwave: perhaps GLM or DeepSeek could be integrated into the mix for the purposes of red-teaming the code. Fable has been blinded to security by design[0], and the open models are pretty decent at it.

[0] It's not clear what the situation with GPT-5.6 will be but the blog suggests similarly over-cautious safety filters.

Amusingly the posts for recent Opus releases brag that they successfully made it worse at security! "during its [Opus 4.7] training we experimented with efforts to differentially reduce these ["cyber"] capabilities"

budsniffer952 • today at 9:53 AM

>Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of

Is a single thing in your post demonstrable, or are we just supposed to take your word for it? Because all of this stuff sounds laughably subjective.

➕ show 2 replies

mklarmann • today at 8:21 AM

It’s Gartner. Top-right is where you want to be.

➕ show 1 reply

pbowyer • today at 6:45 AM

> I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.

Do you find that makes a difference in your work? I've been using 5.5 high/xhigh to optimize and benchmark a C codebase, and just reading the initial code virtually fills the first context window. A session will auto-compact 5-15 times, but it seems to do okay in spite of that because the task is mainly focused on the latest window each time.

I think for programming the strength of GPT over Opus is winning here over the context window.

➕ show 1 reply

cherryteastain • today at 8:00 AM

You can set GPT 5.5 to 1M context mode in Cursor but it costs more after the default 272k.

➕ show 1 reply

0123456789ABCDE • today at 8:11 AM

opus@max is on average worst than opux@xhigh

for supporting evidence, see first chart here: https://www.anthropic.com/news/claude-fable-5-mythos-5

alt Hacker News

Replies