> I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.
Wait really? I honestly would have thought this was a solved problem by now, especially high quality transcriptions bit, just out of curiosity, is the problem that the quality isn't high enough?
There are still a few unsolved problems that require tuning for specific applications. Applications that own the video call have a much easier time, they have access to each individual audio stream. Applications like this, however, have to deal with overlapping voices from a single stream. If it's trying to attribute each utterance to an individual, separating the voices is tough, or can lead to confusing transcripts. There are many little problems like this which make it a tough problem in real world usage. Domain specific terms, or proper nouns is another source of inaccuracy.