Ctrl-F in that document for 'hashing'. That step reduces the audio information to a sparse collection of key points, one for each of four frequency ranges per time segment. I would assume that everything up to that step is done on the phone and only the key points are sent to the server.
I stand corrected. Just sped-read over the abstract.