Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise

Christian Heider Nielsen,Zheng-Hua Tan
DOI: https://doi.org/10.48550/arXiv.2211.01621
2022-11-03
Abstract:In recent years, significant progress has been made in deep model-based automatic speech recognition (ASR), leading to its widespread deployment in the real world. At the same time, adversarial attacks against deep ASR systems are highly successful. Various methods have been proposed to defend ASR systems from these attacks. However, existing classification based methods focus on the design of deep learning models while lacking exploration of domain specific features. This work leverages filter bank-based features to better capture the characteristics of attacks for improved detection. Furthermore, the paper analyses the potentials of using speech and non-speech parts separately in detecting adversarial attacks. In the end, considering adverse environments where ASR systems may be deployed, we study the impact of acoustic noise of various types and signal-to-noise ratios. Extensive experiments show that the inverse filter bank features generally perform better in both clean and noisy environments, the detection is effective using either speech or non-speech part, and the acoustic noise can largely degrade the detection performance.
Audio and Speech Processing,Cryptography and Security,Machine Learning,Sound
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the ability to detect adversarial attacks in deep - speech recognition systems, especially the detection performance in noisy environments. Specifically, the paper focuses on the following aspects: 1. **Utilizing domain - specific features**: The paper proposes to use filter - bank - based features to better capture the characteristics of adversarial attacks and improve the detection effect. In particular, the performance of inverse filter - bank features in adversarial attack detection is studied. 2. **Analyzing the roles of speech and non - speech parts**: The paper explores the potential of using speech parts and non - speech parts respectively for adversarial attack detection, as well as their relative importance in detection. 3. **Studying the influence of noisy environments**: Considering that speech recognition systems may be deployed in various adverse acoustic environments in practical applications, the paper studies the influence of different types and signal - to - noise ratios (SNR) of noise on the performance of adversarial attack detection. ### Main contributions of the paper: 1. **Systematically studied multiple cepstral features**: Including MFCC (Mel - Frequency Cepstral Coefficients), IMFCC (Inverse MFCC), GFCC (Gammatone - Frequency Cepstral Coefficients), IGFCC (Inverse GFCC) and LFCC (Linear - Frequency Cepstral Coefficients). The performance of these features in adversarial attack detection is compared and analyzed in detail. 2. **Systematically studied the influence of noisy environments on detection performance for the first time**: Through experiments, the performance changes of adversarial attack detection under different noise types and SNR conditions are verified. 3. **Analyzed the roles of speech and non - speech parts in detection**: The research found that the non - speech part provides stronger clues when detecting adversarial attacks, and combining the speech and non - speech parts during training can further improve the detection performance. 4. **Improved the reproducibility of the work**: Provided the source code and parameter configurations, enabling other researchers to reproduce the experimental results, and made the complete experimental result catalogue public. ### Experimental design and methods: - **Data set**: Two data sets are used, one is a white - box attack data set, and the other is a black - box attack data set. The samples in the data set are cut into 512 - millisecond chunks, and each chunk is further divided into frames to extract cepstral features. - **Detection model**: A convolutional neural network (CNN) is used as a classification model, with an input feature vector of a fixed size of (31, 20) and an output of a binary classification result (benign or adversarial sample). - **Feature extraction**: Multiple cepstral features are studied, including MFCC, IMFCC, GFCC, IGFCC and LFCC, and the importance of different frequency regions is explored. - **Detection in noisy environments**: Multiple noise types (such as restaurant noise, bus noise, etc.) and different SNR conditions are considered, and the detection performance of different features in noisy environments is evaluated. ### Experimental results: - **Performance of cepstral features**: Inverse filter - bank features (IGFCC and IMFCC) perform best in most cases, especially the high - resolution features in the linear - frequency region. - **Influence of noisy environments**: Noise significantly reduces the detection performance, especially under low - SNR conditions. IGFCC shows strong robustness in noisy environments. - **Roles of speech and non - speech parts**: The non - speech part provides stronger clues when detecting adversarial attacks, and combining the speech and non - speech parts during training can further improve the detection performance. In conclusion, through systematic research and experiments, this paper proposes effective methods for detecting adversarial attacks and explores the detection performance under different features and environmental conditions, providing an important reference for improving the security of speech recognition systems.