Here's an algorithm I cooked up for my (never completed) master's thesis:
It's based on the assumption that the most common frequency difference in all pairs of spectrum peaks is the base frequency of the sound.
-For the FFT use the Gaussian window because then your peaks look like Gaussians - the logarithm of a Gaussian is a parabola, so you only need three samples around the peak to calculate the exact frequency.
-Gather all the peaks along with their amplitudes. Pair all combinations.
-Create a histogram of frequency differences in those pairs, weighted by the product of the amplitudes of the peaks.
When you recognise a frequency you can attenuate it via comb filter and run the algorithm again to find another one.