logoalt Hacker News

dijksterhuistoday at 1:19 PM0 repliesview on HN

i did something in my phd developing an attack against mozilla deepspeech.

deepspeech used the CTC algorithm [0], which adds a “blank” character token to indicate repeats of a predicted normal alphabet character token over a sequence of audio/speech feature inputs.

so "h==e=l===l===o====" maps to "hello"

the model becomes super biased towards predicting that blank token. one speech feature is like 0.1 second of audio or less (can’t remember off hand). so there are a lot of alphabet character token repeats. off hand i seem to remember the predicted token distribution over like 1000 audio files was 50% blank token and then 50% distributed across the rest of the alphabet.

as a result, you can get significantly smaller perturbations when generating adversarial examples. by like a factor of 2-4 or something. all you need to do is prioritise blank tokens in your target output.

i spent 2 years trying to find a super clever attack. turns out all i needed to do was make one simple graph counting characters. xD

[0]: https://en.wikipedia.org/wiki/Connectionist_temporal_classif...