Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.
The iteration speed advantage is real but context-specific. For agentic workloads where you're running loops over structured data -- say, validating outputs or exploring a dataset across many small calls -- the latency difference between a 50 tok/s model and a 1000+ tok/s one compounds fast. What would take 10 minutes wall-clock becomes under a minute, which changes how you prototype.
The open question for me is whether the quality ceiling is high enough for cases where the bottleneck is actually reasoning, not iteration speed. volodia's framing of it as a "fast agent" model (comparable tier to Haiku 4.5) is honest -- for the tasks that fit that tier, the 5x speed advantage is genuinely interesting.
Does it mean if it was embedded on a Talaas chip, it could generate ~50,000+ tokens per second?
I'm not sold on diffusion models.
Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases
Here's more detail on how price/performance stacks up
What excites me most about these new 4figure/second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn't even feel it, potentially fixing some of the weird hallucinatory/non-deterministic behavior we sometimes end up with.
My attempt with trying one of their OOTB prompts in the demo https://chat.inceptionlabs.ai resulted in: "The server is currently overloaded. Please try again in a moment."
And a pop-up error of: "The string did not match the expected pattern."
That happened three times, then the interface stopped working.
I was hoping to see how this stacked up against Taalas demo, which worked well and was so fast every time I've hit it this past week.
There's a potentially amazing use case here around parsing PDFs to markdown. It seems like a task with insane volume requirements, low budget, and the kind of thing that doesn't benefit much from autoregression. Would be very curious if your team has explored this.
I tried Mercury 1 in Zed for inline completions and it was significantly slower than Cursors autocomplete. Big reason why I switched backed to Cursor(free)+Claude Code
It seems like the chat demo is really suffering from the effect of everything going into a queue. You can't actually tell that it is fast at all. The latency is not good.
Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.
Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.
Genuine question: what kinds of workloads benefit most from this speed? In my coding use, I still hit limitations even with stronger models, so I'm interested in where a much faster model changes the outcome rather than just reducing latency.
This is unbelievably fast
I can see some promise with diffusion LLMs, but getting them comparable to the frontier is going to require a ton of work and these closed source solutions probably won't really invigorate the field to find breakthroughs. It is too bad that they are following the path of OpenAI with closed models without details as far as I can tell.
I believe Jimmy Chat is still faster by an order of magnitude…
this looks awesome!!
[dead]
I am little underwhelmed by anything diffusion at the moment - they didn't really deliver.
Please pre-render your website on the server. Client-side JS means that my agent cannot read the press-release and that reduces the chance I am going to read it myself. Also, day one OpenRouter increases the chance that someone will try it.
It could be interesting to do the metric of intelligence per second.
ie intelligence per token, and then tokens per second
My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.
But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.