Abstract:In recent years, significant progress has been made in deep model-based automatic speech recognition (ASR), leading to its widespread deployment in the real world. At the same time, adversarial attacks against deep ASR systems are highly successful. Various methods have been proposed to defend ASR systems from these attacks. However, existing classification based methods focus on the design of deep learning models while lacking exploration of domain specific features. This work leverages filter bank-based features to better capture the characteristics of attacks for improved detection. Furthermore, the paper analyses the potentials of using speech and non-speech parts separately in detecting adversarial attacks. In the end, considering adverse environments where ASR systems may be deployed, we study the impact of acoustic noise of various types and signal-to-noise ratios. Extensive experiments show that the inverse filter bank features generally perform better in both clean and noisy environments, the detection is effective using either speech or non-speech part, and the acoustic noise can largely degrade the detection performance.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the ability to detect adversarial attacks in deep - speech recognition systems, especially the detection performance in noisy environments. Specifically, the paper focuses on the following aspects: 1. **Utilizing domain - specific features**: The paper proposes to use filter - bank - based features to better capture the characteristics of adversarial attacks and improve the detection effect. In particular, the performance of inverse filter - bank features in adversarial attack detection is studied. 2. **Analyzing the roles of speech and non - speech parts**: The paper explores the potential of using speech parts and non - speech parts respectively for adversarial attack detection, as well as their relative importance in detection. 3. **Studying the influence of noisy environments**: Considering that speech recognition systems may be deployed in various adverse acoustic environments in practical applications, the paper studies the influence of different types and signal - to - noise ratios (SNR) of noise on the performance of adversarial attack detection. ### Main contributions of the paper: 1. **Systematically studied multiple cepstral features**: Including MFCC (Mel - Frequency Cepstral Coefficients), IMFCC (Inverse MFCC), GFCC (Gammatone - Frequency Cepstral Coefficients), IGFCC (Inverse GFCC) and LFCC (Linear - Frequency Cepstral Coefficients). The performance of these features in adversarial attack detection is compared and analyzed in detail. 2. **Systematically studied the influence of noisy environments on detection performance for the first time**: Through experiments, the performance changes of adversarial attack detection under different noise types and SNR conditions are verified. 3. **Analyzed the roles of speech and non - speech parts in detection**: The research found that the non - speech part provides stronger clues when detecting adversarial attacks, and combining the speech and non - speech parts during training can further improve the detection performance. 4. **Improved the reproducibility of the work**: Provided the source code and parameter configurations, enabling other researchers to reproduce the experimental results, and made the complete experimental result catalogue public. ### Experimental design and methods: - **Data set**: Two data sets are used, one is a white - box attack data set, and the other is a black - box attack data set. The samples in the data set are cut into 512 - millisecond chunks, and each chunk is further divided into frames to extract cepstral features. - **Detection model**: A convolutional neural network (CNN) is used as a classification model, with an input feature vector of a fixed size of (31, 20) and an output of a binary classification result (benign or adversarial sample). - **Feature extraction**: Multiple cepstral features are studied, including MFCC, IMFCC, GFCC, IGFCC and LFCC, and the importance of different frequency regions is explored. - **Detection in noisy environments**: Multiple noise types (such as restaurant noise, bus noise, etc.) and different SNR conditions are considered, and the detection performance of different features in noisy environments is evaluated. ### Experimental results: - **Performance of cepstral features**: Inverse filter - bank features (IGFCC and IMFCC) perform best in most cases, especially the high - resolution features in the linear - frequency region. - **Influence of noisy environments**: Noise significantly reduces the detection performance, especially under low - SNR conditions. IGFCC shows strong robustness in noisy environments. - **Roles of speech and non - speech parts**: The non - speech part provides stronger clues when detecting adversarial attacks, and combining the speech and non - speech parts during training can further improve the detection performance. In conclusion, through systematic research and experiments, this paper proposes effective methods for detecting adversarial attacks and explores the detection performance under different features and environmental conditions, providing an important reference for improving the security of speech recognition systems.

Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise

Echo: Reverberation-based Fast Black-Box Adversarial Attacks on Intelligent Audio Systems.

Understanding and Benchmarking the Commonality of Adversarial Examples

The Silent Manipulator: A Practical and Inaudible Backdoor Attack against Speech Recognition Systems

Learning Normality is Enough: A Software-based Mitigation against Inaudible Voice Attacks

Defending Adversarial Attacks on Cloud-aided Automatic Speech Recognition Systems.

Adversarial Attack and Defense on Deep Neural Network-Based Voice Processing Systems: An Overview

Robustifying automatic speech recognition by extracting slowly varying features

Mel frequency spectral domain defenses against adversarial attacks on speech recognition systems

Query-Efficient Adversarial Attack with Low Perturbation Against End-to-End Speech Recognition Systems

Adversarial Example Detection by Classification for Deep Speech Recognition

Adversarial Example Devastation and Detection on Speech Recognition System by Adding Random Noise

Adversarial Attack and Defense Strategies of Speaker Recognition Systems: A Survey

Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition

Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models

SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems

Towards the Universal Defense for Query-Based Audio Adversarial Attacks

A Detection Algorithm for Audio Adversarial Examples in EI-Enhanced Automatic Speech Recognition

Model Access Control Based on Hidden Adversarial Examples for Automatic Speech Recognition

Defending against Adversarial Audio via Diffusion Model

ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features