Easily the most interesting part of this announcement is buried in the second to last paragraph: &...

gandreani • today at 6:10 PM • 11 replies • view on HN

Easily the most interesting part of this announcement is buried in the second to last paragraph:

"We're also launching GPT‑5.6 Sol on Cerebras at up to 750 tokens per second in July, bringing frontier intelligence to customers at unprecedented speed. Access will initially be limited to select customers as we expand capacity."

750 tokens/s on a frontier model is going to be extremely interesting. I doubt this new version is anything but a version bump in terms of capabilities but if we can start getting these answers back faster, they end up being more useful.

Just off the top of my head, I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today. If the AI model is 3x faster I have less of chance.

Replies

donquichotte • today at 7:38 PM

> I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today.

Yup, I remember "racing" the AIs to figure things out in codebases just a year ago. Today, I have no chance. Whether it is due to degraded reasoning capabilities on my part or better models, I don't know.

➕ show 3 replies

sberens • today at 6:18 PM

For comparison, openrouter says opus 4.8 is ~55 tokens/s and fast mode is ~102.

750 tokens/s for their largest model is going to be nuts

➕ show 4 replies

bob1029 • today at 8:43 PM

At a certain rate we will be able to move towards continuous / real-time inference systems. The discrete, turn based solutions are quite confining with how they must be trained. Continuous and real-time would fundamentally alter the domain.

From an information theory perspective we are still in dial-up territory with regard to the actual information rate. 750 tokens per second would be a really bad dialup connection. Imagine 10 millions tokens per second.

➕ show 6 replies

eli • today at 7:04 PM

I'm skeptical of how fast "up to" 750t/s really means. Maybe if they make it extremely expensive so it frees up enough capacity?

GPT‑5.3‑Codex‑Spark currently runs on Cerebras chips and it's giving me around 150t/s. Still relatively very fast, but nowhere near the 1,000t/s they claimed at launch. (Also it's not a very good model.)

That said, I'm super bought in to faster models being better for most use cases than smarter models.

➕ show 1 reply

tontinton • today at 6:19 PM

Yep this is a glimpse into the future of 500+ t/s, which is in my opinion the next big thing that validates Jevon's paradox (the models are already smart enough)

➕ show 4 replies

motoboi • today at 7:41 PM

bean in mind that "GPT‑5.6 Sol on Cerebras at up to 750 tokens per second" not necessarily means the same model (in terms of inference result). It can mean anything like a very quantized model, a different level of model activation per inference etc.

Of course we can trust that wouldn't name the same thing with different levels of intelligence, right? Right?

➕ show 1 reply

swalsh • today at 8:17 PM

This would be amazing for some of our "real-time" workflows, that need to fallback to AI for one reason or another. What used to happen is a rules based system did the majority of work, and occasional corner case would fall back to humans. Then we moved AI in, still not real time, but much faster. Cerebras could make that even faster.

helloplanets • today at 6:18 PM

OpenAI also announced two days ago that they're starting to make Cerebras style chips themselves [0], will be interesting to see how fast SotA model inference will be by the end of the year.

[0]: https://openai.com/index/openai-broadcom-jalapeno-inference-...

➕ show 4 replies

lostmsu • today at 7:54 PM

Does the Cerebras variant offer input caching and corresponding discounts? Last I checked Cerebras would not cache or would cache but not give discounts for the cached input, making it impractical for agentic use and multiturn conversations.

cruffle_duffle • today at 6:45 PM

"we can start getting these answers back faster, they end up being more useful."

Dude, 10x token speed is going to be absolutely nuts. Half the "parallel subagent workflow" business seems to be driven simply as a means to avoid tapping your thumbs waiting for the infernal robot to finish something. If things come back speedy quick all the time, it should keep up with the "speed of the human" and let me stay focused on one thread instead of half a dozen. Plus the cost of screwing up gets significantly lower because you just re-fire with an adjusted prompt and iterate.

Someday these things will be 100x as fast as they are today and that is when things will get insane.

➕ show 1 reply

ai_fry_ur_brain • today at 9:15 PM

From what I know about batch processing/ concurrency in inference this is a pipe dream... Or its going to cost an arm and a leg. I think they're lying or its going to be a much smaller model and not "frontier"

alt Hacker News

Replies