The mel spectrum is the first part of a speech recognition pipeline...
But perhaps you'd get better results if more of a ML speech/audio recognition pipeline were included?
Eg. the pipeline could separate out drum beats from piano notes, and present them differently in the visualization?
An autoencoder network trained to minimize perceptual reconstruction loss would probably have the most 'interesting' information at the bottleneck, so that's the layer I'd feed into my LED strip.
I was playing around with this recently, but the problem I encountered is that most AI analysis techniques like stem separation aren't built to work in real-time.