Abstract:Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.

What problem does this paper attempt to address?

The paper attempts to address the issue of model-control adversarial attacks that multi-task automatic speech recognition (ASR) models may face when performing specific tasks. Specifically, the paper demonstrates that even without access to model prompts, the behavior of a multi-task ASR model can be altered to perform a different task than intended by appending a short universal adversarial acoustic segment to the input audio signal. For example, the paper shows how this method can force OpenAI's Whisper model, set to perform speech transcription, to execute speech translation instead. ### Main Research Content 1. **Background and Motivation**: - Multi-task ASR models (such as Whisper) can perform various speech processing tasks, such as speech transcription and speech translation. - This flexibility introduces new security vulnerabilities, namely model-control adversarial attacks, where an attacker can change the model's task setting by modifying the input audio. 2. **Threat Model**: - The attacker cannot directly modify the model's internal structure or prompts but can achieve their goal by modifying the input audio. - The attack needs to be conducted in the acoustic space and requires the adversarial segment to be easily applicable to accommodate real-time speech processing. 3. **Attack Method**: - By appending a short universal adversarial acoustic segment to the input audio, the model is forced to perform a different task when executing a specific task. - The adversarial segment is optimized using gradient descent to maximize the probability of generating the target task without arousing suspicion. 4. **Experimental Results**: - Experiments were conducted on multiple language pairs, including French-English, German-English, Russian-English, and Korean-English, to verify the effectiveness and generalization ability of the attack. - The results show that as the intensity of the adversarial segment increases, the effect of the model-control attack gradually approaches the performance upper limit of the free translation mode. - There is a binary distribution in the success rate and translation quality of the attack, meaning the attack either completely succeeds or completely fails, with no intermediate state. ### Conclusion The paper reveals the vulnerability of multi-task speech foundation models to model-control adversarial attacks and demonstrates that adding a short universal adversarial acoustic segment can change the model's task setting. The success of such attacks exhibits a clear binary characteristic, emphasizing the need for enhanced security measures when deploying flexible ASR systems. Future research should focus on developing robust defense mechanisms against model-control adversarial attacks.

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

Echo: Reverberation-based Fast Black-Box Adversarial Attacks on Intelligent Audio Systems.

The Silent Manipulator: A Practical and Inaudible Backdoor Attack against Speech Recognition Systems

UltraBD: Backdoor Attack against Automatic Speaker Verification Systems via Adversarial Ultrasound

Defending Adversarial Attacks on Cloud-aided Automatic Speech Recognition Systems.

Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations

There is more than one kind of robustness: Fooling Whisper with adversarial examples

Universal Adversarial Perturbations for Speech Recognition Systems

Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Time

Query-Efficient Adversarial Attack with Low Perturbation Against End-to-End Speech Recognition Systems

TransAudio: Towards the Transferable Adversarial Audio Attack via Learning Contextualized Perturbations

Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

Watch Your Speed: Injecting Malicious Voice Commands via Time-Scale Modification

Push the Limit of Adversarial Example Attack on Speaker Recognition in Physical Domain

Adversarial Examples for Automatic Speech Recognition: Attacks and Countermeasures

Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

IUAC: Inaudible Universal Adversarial Attacks Against Smart Speakers

Voiceprint Mimicry Attack Towards Speaker Verification System in Smart Home

UniAP: Protecting Speech Privacy with Non-Targeted Universal Adversarial Perturbations

Model Access Control Based on Hidden Adversarial Examples for Automatic Speech Recognition