I really wish those offering speech-to-text models provided transcription benchmarks specific to particular fields of endeavor. I imagine performance would vary wildly when using jargon peculiar to software development, medicine, physics, and law, as compared to everyday speech. Considering that "enterprise" use is often specialized or sub-specialized, it seems like they're leaving money on Dragon's table by not catering to any of those needs.
Try it out! I read various papers full of jargon at high speed, and it is stunning.
https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...