Voxtral Transcribe 2

268 points • by meetpateltech • today at 3:08 PM • 73 comments • view on HN

Comments

This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.

I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:

> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?

➕ show 12 replies

BrunoJo • today at 6:27 PM

If you are looking for an easy transcription API you may want to check out https://lemonfox.ai/. It's powered by Whisper but we are planning to support more models.

gwerbret • today at 6:19 PM

I really wish those offering speech-to-text models provided transcription benchmarks specific to particular fields of endeavor. I imagine performance would vary wildly when using jargon peculiar to software development, medicine, physics, and law, as compared to everyday speech. Considering that "enterprise" use is often specialized or sub-specialized, it seems like they're leaving money on Dragon's table by not catering to any of those needs.

dmix • today at 4:07 PM

> At approximately 4% word error rate on FLEURS and $0.003/min

Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/

➕ show 1 reply

janalsncm • today at 5:41 PM

I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.

https://aclanthology.org/2025.findings-acl.87/

➕ show 3 replies

pietz • today at 4:47 PM

Do we know if this is better than Nvidia Parakeet V3? That has been my go-to model locally and it's hard to imagine there's something even better.

➕ show 4 replies

observationist • today at 3:53 PM

Native diarization, this looks exciting. edit: or not, no diarization in real-time.

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

~9GB model.

➕ show 1 reply

mdrzn • today at 4:03 PM

There's no comparison to Whisper Large v3 or other Whisper models..

Is it better? Worse? Why do they only compare to gpt4o mini transcribe?

➕ show 2 replies

satvikpendem • today at 4:39 PM

Looks like this model doesn't do realtime diarization, what model should I use if I want that? So far I've only seen paid models do diarization well. I heard about Nvidia NeMo but haven't tried that or even where to try it out.

aavci • today at 4:40 PM

What's the cheapest device specs that this could realistically run on?

➕ show 1 reply

jszymborski • today at 6:14 PM

I'm guessing I won't be able to finetune this until they come out with a HF tranformers model, right?

serf • today at 3:53 PM

things I hate:

"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"

So, you don't mean 'try this out', you mean 'buy this product'.

Let's not act like it's a free sampler.

I can't comment on the model : i'm not giving them money.

➕ show 1 reply

siddbudd • today at 5:48 PM

Wired advertises this as "Ultra-Fast Translation"[^1]. A bit weird coming from a tech magazine. I hope it's just a "typo".

[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...

➕ show 1 reply

yewenjie • today at 5:55 PM

One week ago I was on the hunt for an open source model that can do diatization and I had to literally give up because I could not find any easy to use setup.

➕ show 1 reply

antirez • today at 4:15 PM

Italian represents, I believe, the most phonetically advanced human language. It has the right compromise among information density, understandability, and ability to speech much faster to compensate the redundancy. It's like if it had error correction built-in. Note that it's not just that it has the lower error rate, but is also underrepresented in most datasets.

➕ show 6 replies

XCSme • today at 5:43 PM

Is it me or error rate of 3% is really high?

If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.

➕ show 1 reply

Archelaos • today at 4:12 PM

As a rule of thumb for software that I use regularly, it is very useful to consider the costs over a 10-year period in order to compare it with software that I purchase for lifetime to install at home. So that means 1,798.80 $ for the Pro version.

What estimates do others use?

derac • today at 6:02 PM

Any chance Voxtral Mini Transcribe 2 will ever be an open model?

ewuhic • today at 5:52 PM

Can it translate in real time?

boringg • today at 4:52 PM

Pseudo related -- am I the only one uncomfortable using my voice with AI for the concern that once it is in the training model it is forever reproducible? As a non-public person it seems like a risk vector (albeit small),

➕ show 1 reply

dumpstate • today at 5:31 PM

I'm on voxtral-mini-latest and that's why I started seeing 500s today lol

varispeed • today at 4:06 PM

[flagged]

➕ show 3 replies

alt Hacker News

Voxtral Transcribe 2

Comments