This is great, I've tried out automated podcast editing tools before and they cut too aggressively in my experience. What are you thinking about doing next with this now that you've gotten the alignment snapping working cleanly for 'um' and 'ah', are you thinking of expanding the tool?
This approach seems kind of backwards to me. Why try to detect everything except the thing you're trying to remove instead of either sampling a few uhs and ums and treating them as noise to be silenced (with a sharp crossfade to the noise floor that doesn't interrupt speech flow) or finetuning a model to detect them specifically for full automation?
Not to promote something, but Wispr Flow does that for me automatically if I trigger a setting for it..
While it's a commercial product with a subscription, I spent a long time on the free tier not even hitting their limits until I started using it so extensively that I wanted to pay for it.
And I've used Whisper in the past, mostly for tinkering. I tried it for a couple of use cases but haven't touched the base project in a while. But I do regularly use Faster-Whisper-XXL, an open source project based on Whisper, for subtitle generation.
Though, for subtitle generation, I decided to support the project and mainly use the non-public build of Faster-Whisper-XXL Pro built for donators to the open source project.
The extra features smooth out the subtitle editing process very substantially. Toss in "--roformer_overlap 0.125 --roformer_vram 16 --best_of 15 --ff_vocal_extract mb-roformer --vad_method pyannote_v3" to the cli parameters (and sometimes --realign) and you have much less work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.
This is fascinating! I'm going to try this on a certain clip from Jurassic Park.
I would love to see support for videos and removal of custom filler words (I say 'basically' and 'like' a lot and have so far failed to improve myself on this).
What an awesome tool and idea. I’d be keen to see if it can integrate with video editing tools.
Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) & decide case by case which to keep!
I think it is harder to remove those from your own speech. I have been doing that for few months now and I still get back at it when I am in hurry or stressed.
there is a aah counter in toast master !! this is the software that helps !!
Really cool stuff and definitely going to try it; I’m also finding it wild that Google put effort into adding ums and erms into their text to speech model a while back. AI puts it in, AI helps take it out.
Disfluencies are not necessarily "filler". They can convey mood or hesitation. Cutting them can change the meaning.
A trivial example is "umm... well... (sigh) okay" versus just "okay". Not okay!
This post is mostly about how surprisingly hard it is to cut filler words out of speech cleanly. Apparently, stripping ums isn't a find and replace type thing, because Whisper's timestamps are off by up to a few hundred ms and cutting on them chops syllables or leaves stutters. So, I built a tool, erm, that starts from Whisper's guess, finds where each word actually starts and stops in the audio, and snaps the cuts to silence so there's no click, with ffmpeg doing the splicing.
[flagged]
It’s a nice engineering approach, but I’m interested in the motivation. Um and ah is distracting in a transcript, where you can naturally pause to take in information; in speech however it can serve as a focusing point to indicate the next part is important. See https://medium.com/better-humans/dont-worry-about-saying-um-... for example. The weirdly obsessive zeal that orgs like Toastmasters have about eliminating them is weird.
Disfluencies aren’t necessarily bad even if the word starts with “dis”!