Neat.
When working in a linguistics lab as an undergraduate long ago, we looked at spectrograms to identify sounds (specifically places of articulation) as much as listened to recordings.
So it makes some sense to build a model on them rather than some other representation of the sound.