Mercury 2: The fastest reasoning LLM, powered by diffusion

157 points • by fittingopposite • yesterday at 10:46 PM • 81 comments • view on HN

Comments

It could be interesting to do the metric of intelligence per second.

ie intelligence per token, and then tokens per second

My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.

But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.

➕ show 7 replies

volodia • today at 1:57 AM

Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.

➕ show 6 replies

vicchenai • today at 4:49 AM

The iteration speed advantage is real but context-specific. For agentic workloads where you're running loops over structured data -- say, validating outputs or exploring a dataset across many small calls -- the latency difference between a 50 tok/s model and a 1000+ tok/s one compounds fast. What would take 10 minutes wall-clock becomes under a minute, which changes how you prototype.

The open question for me is whether the quality ceiling is high enough for cases where the bottleneck is actually reasoning, not iteration speed. volodia's framing of it as a "fast agent" model (comparable tier to Haiku 4.5) is honest -- for the tasks that fit that tier, the 5x speed advantage is genuinely interesting.

smusamashah • today at 5:58 AM

Does it mean if it was embedded on a Talaas chip, it could generate ~50,000+ tokens per second?

nylonstrung • today at 1:26 AM

I'm not sold on diffusion models.

Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases

Here's more detail on how price/performance stacks up

https://artificialanalysis.ai/models/mercury-2

➕ show 3 replies

dvt • today at 12:09 AM

What excites me most about these new 4figure/second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn't even feel it, potentially fixing some of the weird hallucinatory/non-deterministic behavior we sometimes end up with.

➕ show 2 replies

rancar2 • today at 4:36 AM

My attempt with trying one of their OOTB prompts in the demo https://chat.inceptionlabs.ai resulted in: "The server is currently overloaded. Please try again in a moment."

And a pop-up error of: "The string did not match the expected pattern."

That happened three times, then the interface stopped working.

I was hoping to see how this stacked up against Taalas demo, which worked well and was so fast every time I've hit it this past week.

serjester • today at 2:53 AM

There's a potentially amazing use case here around parsing PDFs to markdown. It seems like a task with insane volume requirements, low budget, and the kind of thing that doesn't benefit much from autoregression. Would be very curious if your team has explored this.

dmix • today at 4:47 AM

I tried Mercury 1 in Zed for inline completions and it was significantly slower than Cursors autocomplete. Big reason why I switched backed to Cursor(free)+Claude Code

ilaksh • today at 12:38 AM

It seems like the chat demo is really suffering from the effect of everything going into a queue. You can't actually tell that it is fast at all. The latency is not good.

Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.

➕ show 1 reply

nowittyusername • today at 2:32 AM

Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.

➕ show 1 reply

tl2do • today at 12:14 AM

Genuine question: what kinds of workloads benefit most from this speed? In my coding use, I still hit limitations even with stronger models, so I'm interested in where a much faster model changes the outcome rather than just reducing latency.

➕ show 6 replies

davistreybig • today at 3:24 AM

This is unbelievably fast

mhitza • today at 1:17 AM

Comment retracted. My bad, missed some details.

➕ show 2 replies

chriskanan • today at 2:14 AM

I can see some promise with diffusion LLMs, but getting them comparable to the frontier is going to require a ton of work and these closed source solutions probably won't really invigorate the field to find breakthroughs. It is too bad that they are following the path of OpenAI with closed models without details as far as I can tell.

exabrial • today at 2:42 AM

I believe Jimmy Chat is still faster by an order of magnitude…

➕ show 1 reply

lprimeisafk • today at 2:13 AM

It fails the car wash test

➕ show 1 reply

dw5ight • today at 2:24 AM

this looks awesome!!

MarcLore • today at 2:00 AM

[dead]

dhruv3006 • today at 1:47 AM

I am little underwhelmed by anything diffusion at the moment - they didn't really deliver.

➕ show 1 reply

arjie • today at 1:34 AM

Please pre-render your website on the server. Client-side JS means that my agent cannot read the press-release and that reduces the chance I am going to read it myself. Also, day one OpenRouter increases the chance that someone will try it.

alt Hacker News

Mercury 2: The fastest reasoning LLM, powered by diffusion

Comments