Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open...

vessenes • yesterday at 3:46 PM • 8 replies • view on HN

Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.

Replies

pseudony • yesterday at 6:58 PM

I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.

Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).

My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.

(I even picked the 10usd plan, it was fine for now).

HumanOstrich • yesterday at 4:36 PM

I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.

They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.

➕ show 6 replies

Workaccount2 • yesterday at 5:14 PM

Unless one of the open model labs has a breakthrough, they will always lag. Their main trick is distilling the SOTA models.

People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.

➕ show 3 replies

behnamoh • yesterday at 5:05 PM

> The UI oneshot demos are a big improvement over 4.6.

This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.

mckirk • yesterday at 4:03 PM

Note that this is the Flash variant, which is only 31B parameters in total.

And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.

➕ show 1 reply

ttoinou • yesterday at 5:11 PM

Sonnet was already very good a year ago, do open weights model right are as good ?

➕ show 2 replies

alt Hacker News

Replies