logoalt Hacker News

tl2doyesterday at 11:23 PM2 repliesview on HN

This matches my experience. In Kaggle audio competitions, I've seen many competitors struggle with basics like proper PCM filtering - anti-aliasing before downsampling, handling spectral leakage, etc.

Audio really is a blue ocean compared to text/image ML. The barriers aren't primarily compute or data - they're knowledge. You can't scale your way out of bad preprocessing or codec choices.

When 4 researchers can build Moshi from scratch in 6 months while big labs consider voice "solved," it shows we're still in a phase where domain expertise matters more than scale. There's an enormous opportunity here for teams who understand both ML and signal processing fundamentals.


Replies

nubgtoday at 2:21 AM

AI bot comment

derf_today at 12:42 AM

Also, while the author complains that there is not a lot of high quality data around [0], you do not need a lot of data to train small models. Depending on the problem you are trying to solve, you can do a lot with single-digit gigabytes of audio data. See, e.g., https://jmvalin.ca/demo/rnnoise/

[0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only.

show 1 reply