Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

Vyas Raina,Rao Ma,Charles McGhee,Kate Knill,Mark Gales
2024-07-17
Abstract:Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as $\texttt{<|endoftext|>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to use adversarial attacks to disable automatic speech recognition (ASR) systems. Specifically, it explores how to generate a universal, short-term acoustic adversarial sample that causes large speech foundation models like Whisper to "lose voice" (i.e., not transcribe) when processing any speech signal. The paper demonstrates the effectiveness and broad applicability of this adversarial attack and discusses its potential risks and benefits. ### Specific Problem Description: 1. **Goal of the Adversarial Attack**: The paper proposes a simple and effective method to learn a universal, 0.64-second-long acoustic adversarial sample. When this sample is added to the beginning of any speech signal, it can lead the model to ignore the actual speech and only transcribe a specific special token (such as `<endoftext>`), thus causing it to "lose voice." 2. **Experimental Validation**: The paper experimentally validates the effectiveness of this method, showing that the same 0.64-second adversarial audio clip can successfully cause the target Whisper ASR model to "lose voice" on over 97% of speech samples. Additionally, this adversarial sample has the ability to transfer across datasets and tasks. 3. **Potential Impact**: The paper discusses the potential risks and benefits of this "loss of voice" attack in the real world. For example, attackers could exploit this vulnerability to bypass speech auditing systems and publish harmful content. On the other hand, this technology could also be used to protect privacy by preventing sensitive speech data from being automatically transcribed. ### Main Contributions: 1. **Short-term Adversarial Sample**: Developed a 0.64-second-long adversarial audio clip that can be added to the beginning of any speech signal to cause it to "lose voice." 2. **Universality**: This adversarial sample is universal and can be applied to any speech signal. 3. **Modern ASR Systems**: The method is applicable to modern, powerful ASR systems, such as the Whisper series models. 4. **Specific Target**: The specific target of the attack is to cause the Whisper model to "lose voice," a targeted goal not previously considered in research, with practical significance. 5. **Transferability**: This adversarial sample not only performs well across different datasets but can also transfer across different speech processing tasks (such as speech transcription and translation). ### Conclusion: The paper demonstrates the vulnerability of large speech foundation models like Whisper to adversarial attacks, particularly the potential risks and benefits of "loss of voice" attacks. These findings are significant for improving the security of ASR systems.