Abstract:Automatic Speech Recognition (ASR) systems must be robust to the myriad types of noises present in real-world environments including environmental noise, room impulse response, special effects as well as attacks by malicious actors (adversarial attacks). Recent works seek to improve accuracy and robustness by developing novel Deep Neural Networks (DNNs) and curating diverse training datasets for them, while using relatively simple acoustic features. While this approach improves robustness to the types of noise present in the training data, it confers limited robustness against unseen noises and negligible robustness to adversarial attacks. In this paper, we revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception that could be used to perform accurate and robust ASR. In contrast, Specifically, we evaluate the ASR accuracy and robustness of several biologically inspired acoustic features. In addition to several features from prior works, such as gammatone filterbank features (GammSpec), we also propose two new acoustic features called frequency masked spectrogram (FreqMask) and difference of gammatones spectrogram (DoGSpec) to simulate the neuro-psychological phenomena of frequency masking and lateral suppression. Experiments on diverse models and datasets show that (1) DoGSpec achieves significantly better robustness than the highly popular log mel spectrogram (LogMelSpec) with minimal accuracy degradation, and (2) GammSpec achieves better accuracy and robustness to non-adversarial noises from the Speech Robust Bench benchmark, but it is outperformed by DoGSpec against adversarial attacks.

Analysis of noise robustness of auditory features in speech recognition

How Noise and Language Proficiency Influence Speech Recognition by Individual Non-Native Listeners.

Auditory Features Based on Gammatone Filters for Robust Speech Recognition.

Modified MFCCs for Robust Speaker Recognition

On the Importance of Components of the MFCC in Speech and Speaker Recognition.

A Study of Acoustic Features in Arabic Speaker Identification under Noisy Environmental Conditions

A Noise Robust Front End Algorithm for Mandarin Speech Recognition and Performance Analysis

Harmonic Intensity Feature for Robust Speech Recognition

Noise-robustness of speaker verification based on the perceptual log area ratio

Statistical Thresholding for Robust ASR

Revisiting Acoustic Features for Robust ASR

Bottleneck Features Based On Gammatone Frequency Cepstral Coefficients

Robust F0 Modeling for Mandarin Speech Recognition in Noise.

Auditory Feature for Monaural Speech Segregation

Research and Prospect on Robustness Technology in Real-environment Speech Recognition

Flooring the observation probability for robust ASR in impulsive noise

A novel hybrid feature method based on Caelen auditory model and gammatone filterbank for robust speaker recognition under noisy environment and speech coding distortion

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

An Auditory Feature Extraction Method Based on Forward-Masking and Its Application in Robust Speaker Identification and Speech Recognition.

Assessing Level-Dependent Segmental Contribution to the Intelligibility of Speech Processed by Single-Channel Noise-Suppression Algorithms

A Robust Speech Feature - Perceptive Scalogram Based on Wavelet Analysis