I've been working on the flip side of this with ASR models, but the problem space is the same, conversational/real-world data is needed. Whisper often mistook actual words I say and hallucinate all the time when speaking technical jargon. The solution is to fine-tuning whisper with my own data. Hardest part imo was getting the actual data, which in turn got me to build listenr (https://github.com/rebreda/listenr).It's an always-on VAD-based audio dataset builder. Could be used for building conversational/real-world voice datasets for TTS models too?
After getting it working i was get motivation to actually able to build out the full fine-tuning pipeline. I wrote a little post about it all https://quickthoughts.ca/posts/listenr-asr-training-data-pro...