Abstract:Voice interfaces are becoming accepted widely as input methods for a diverse set of devices. This development is driven by rapid improvements in automatic speech recognition (ASR), which now performs on par with human listening in many tasks. These improvements base on an ongoing evolution of DNNs as the computational core of ASR. However, recent research results show that DNNs are vulnerable to adversarial perturbations, which allow attackers to force the transcription into a malicious output. In this paper, we introduce a new type of adversarial examples based on psychoacoustic hiding. Our attack exploits the characteristics of DNN-based ASR systems, where we extend the original analysis procedure by an additional backpropagation step. We use this backpropagation to learn the degrees of freedom for the adversarial perturbation of the input signal, i.e., we apply a psychoacoustic model and manipulate the acoustic signal below the thresholds of human perception. To further minimize the perceptibility of the perturbations, we use forced alignment to find the best fitting temporal alignment between the original audio sample and the malicious target transcription. These extensions allow us to embed an arbitrary audio input with a malicious voice command that is then transcribed by the ASR system, with the audio signal remaining barely distinguishable from the original signal. In an experimental evaluation, we attack the state-of-the-art speech recognition system Kaldi and determine the best performing parameter and analysis setup for different types of input. Our results show that we are successful in up to 98% of cases with a computational effort of fewer than two minutes for a ten-second audio file. Based on user studies, we found that none of our target transcriptions were audible to human listeners, who still understand the original speech content with unchanged accuracy.

Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations

The Silent Manipulator: A Practical and Inaudible Backdoor Attack against Speech Recognition Systems

Remote Attacks on Speech Recognition Systems Using Sound from Power Supply

Echo: Reverberation-based Fast Black-Box Adversarial Attacks on Intelligent Audio Systems.

FenceSitter: Black-box, Content-Agnostic, and Synchronization-Free Enrollment-Phase Attacks on Speaker Recognition Systems

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

Toward Stealthy Backdoor Attacks Against Speech Recognition via Elements of Sound

Towards Stealthy Backdoor Attacks against Speech Recognition via Elements of Sound

VenoMave: Targeted Poisoning Against Speech Recognition

Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding

There is more than one kind of robustness: Fooling Whisper with adversarial examples

Query-Efficient Adversarial Attack with Low Perturbation Against End-to-End Speech Recognition Systems

SlothSpeech: Denial-of-service Attack Against Speech Recognition Models

Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition

WaveFuzz: A Clean-Label Poisoning Attack to Protect Your Voice

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Adversarial Agents For Attacking Inaudible Voice Activated Devices

Data Poisoning and Backdoor Attacks on Audio Intelligence Systems

Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Time

VSVC: Backdoor attack against Keyword Spotting based on Voiceprint Selection and Voice Conversion