There's no comparison to Whisper Large v3 or other Whisper models..
Is it better? Worse? Why do they only compare to gpt4o mini transcribe?
Gpt4o mini transcribe is better and actually realtime. Whisper is trained to encode the entire audio (or at least 30s chunks) and then decode it.
WER is slightly misleading, but Whisper Large v3 WER is classically around 10%, I think, and 12% with Turbo.
The thing that makes it particularly misleading is that models that do transcription to lowercase and then use inverse text normalization to restore structure and grammar end up making a very different class of mistakes than Whisper, which goes directly to final form text including punctuation and quotes and tone.
But nonetheless, they're claiming such a lower error rate than Whisper that it's almost not in the same bucket.