Does this transfer to Whisper / CLAP-type audio models or is it ASR-decoder specific? Whisper would be intresting given how widely it's used in prod.
Audio adv. examples didn't used to show the same degree of transferability (generate for one model, works against another) that image adv. examples were able to achieve. Likely because of the RNN architecture or just audio is harder :shrug:
the article says
> This required full access to the model, restricting the researchers to open models with publicly available weights. They found, however, that attacks developed for open models transferred to commercial models from Microsoft and Mistral that share the same underlying architecture.
so it depends on what architecture whisper is using (i don't think they're LLM? or they weren't last time i checked about 4 years ago lol)
edit -- replaced last section, missed this bit in the article
Yeah, there have been several papers with attacks on Whisper:
- Inject adversarial noise to make it transcribe what you want (https://arxiv.org/abs/2210.17316)
- Stop it from transcribing (https://arxiv.org/abs/2405.06134)
- Adversarial prompt injection to make it translate instead of transcribe (https://arxiv.org/abs/2407.04482v2).