Abstract:Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as $\texttt{<|endoftext|>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to use adversarial attacks to disable automatic speech recognition (ASR) systems. Specifically, it explores how to generate a universal, short-term acoustic adversarial sample that causes large speech foundation models like Whisper to "lose voice" (i.e., not transcribe) when processing any speech signal. The paper demonstrates the effectiveness and broad applicability of this adversarial attack and discusses its potential risks and benefits. ### Specific Problem Description: 1. **Goal of the Adversarial Attack**: The paper proposes a simple and effective method to learn a universal, 0.64-second-long acoustic adversarial sample. When this sample is added to the beginning of any speech signal, it can lead the model to ignore the actual speech and only transcribe a specific special token (such as `<endoftext>`), thus causing it to "lose voice." 2. **Experimental Validation**: The paper experimentally validates the effectiveness of this method, showing that the same 0.64-second adversarial audio clip can successfully cause the target Whisper ASR model to "lose voice" on over 97% of speech samples. Additionally, this adversarial sample has the ability to transfer across datasets and tasks. 3. **Potential Impact**: The paper discusses the potential risks and benefits of this "loss of voice" attack in the real world. For example, attackers could exploit this vulnerability to bypass speech auditing systems and publish harmful content. On the other hand, this technology could also be used to protect privacy by preventing sensitive speech data from being automatically transcribed. ### Main Contributions: 1. **Short-term Adversarial Sample**: Developed a 0.64-second-long adversarial audio clip that can be added to the beginning of any speech signal to cause it to "lose voice." 2. **Universality**: This adversarial sample is universal and can be applied to any speech signal. 3. **Modern ASR Systems**: The method is applicable to modern, powerful ASR systems, such as the Whisper series models. 4. **Specific Target**: The specific target of the attack is to cause the Whisper model to "lose voice," a targeted goal not previously considered in research, with practical significance. 5. **Transferability**: This adversarial sample not only performs well across different datasets but can also transfer across different speech processing tasks (such as speech transcription and translation). ### Conclusion: The paper demonstrates the vulnerability of large speech foundation models like Whisper to adversarial attacks, particularly the potential risks and benefits of "loss of voice" attacks. These findings are significant for improving the security of ASR systems.

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

The Silent Manipulator: A Practical and Inaudible Backdoor Attack against Speech Recognition Systems

Echo: Reverberation-based Fast Black-Box Adversarial Attacks on Intelligent Audio Systems.

Understanding and Benchmarking the Commonality of Adversarial Examples

UltraBD: Backdoor Attack against Automatic Speaker Verification Systems via Adversarial Ultrasound

Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations

Universal Adversarial Perturbations for Speech Recognition Systems

Query-Efficient Adversarial Attack with Low Perturbation Against End-to-End Speech Recognition Systems

There is more than one kind of robustness: Fooling Whisper with adversarial examples

Imperceptible Black-Box Waveform-Level Adversarial Attack Towards Automatic Speaker Recognition

Spoofing Speaker Verification System by Adversarial Examples Leveraging the Generalized Speaker Difference.

Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Time

BypTalker: an Adaptive Adversarial Example Attack to Bypass Prefilter-enabled Speaker Recognition

UniAP: Protecting Speech Privacy with Non-Targeted Universal Adversarial Perturbations

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation

TransAudio: Towards the Transferable Adversarial Audio Attack via Learning Contextualized Perturbations

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Adversarial Music: Real World Audio Adversary Against Wake-word Detection System