I was hoping to make a piano practice assistant for my kids, that would take sheet music in MusicXML format, listen to the microphone stream, and check for things they frequently miss like rests, dynamics, consistent tempos.
Surprisingly the blocker has been identifying notes from the microphone input. I assumed that'd have been a long-solved problem; just do an FFT and find the peaks of the spectrogram? But apparently that doesn't work well when there's harmonics and reverb and such, and you have to use AI models (google and spotify have some) to do it. And so far it still seems to fail if there are more than three notes played simultaneously.
Now I'm baffled how song identification can work, if even identifying notes is so unreliable! Maybe I'm doing something wrong.
Note detection works ok if you ignore the octave. Otherwise, you need to know the relative strength of overtones, which is instrument dependent. Some years ago I built a piano training app with FFT+Kalman filter.
I always wanted to do a keyboard/tablet combo (maybe they make these, I don't know).
The idea is a fully weighted hammer action keyboard with nothing else, such as the Arturia KeyLab 88 MkII, and add to that tiny LED lights above each key. And have a tablet computer which has a tutor, and it shows the notes but also a guitar hero like display of the coming notes, where the LED lights shine for where to press, and correction for timing and heaviness of press, etc.
Here's an algorithm I cooked up for my (never completed) master's thesis:
It's based on the assumption that the most common frequency difference in all pairs of spectrum peaks is the base frequency of the sound.
-For the FFT use the Gaussian window because then your peaks look like Gaussians - the logarithm of a Gaussian is a parabola, so you only need three samples around the peak to calculate the exact frequency.
-Gather all the peaks along with their amplitudes. Pair all combinations.
-Create a histogram of frequency differences in those pairs, weighted by the product of the amplitudes of the peaks.
When you recognise a frequency you can attenuate it via comb filter and run the algorithm again to find another one.