Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

Yufeng Yang,Ashutosh Pandey,DeLiang Wang
2024-03-11
Abstract:It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by $28.4\%$ relatively with a $5.57\%$ WER, and achieves $3.32/4.44\%$ WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the separation problem between front - end enhancement and back - end recognition in single - channel robust Automatic Speech Recognition (ASR) systems. Specifically, although Speech Enhancement (SE) algorithms can improve speech clarity in noisy environments, single - channel speech enhancement has not shown significant advantages in Automatic Speech Recognition in noisy conditions. This is mainly because single - channel speech enhancement introduces distortion, which affects the performance of ASR. To solve this problem, the paper proposes a new method to improve robustness by completely decoupling front - end enhancement from back - end ASR. Specifically: 1. **Front - end enhancement**: Two models are used - the Attentive Recurrent Network (ARN) in the time domain and CrossNet in the time - frequency domain for speech enhancement. These two models process the speech signal in the time domain and the time - frequency domain respectively to remove noise and reverberation. 2. **Back - end ASR**: The back - end ASR model is only trained on clean speech, without considering the distortion that may be introduced by front - end enhancement. In this way, the front - end and back - end can be optimized independently, thus improving the flexibility and performance of the system. 3. **Experimental verification**: The paper conducts experiments on multiple datasets (such as WSJ, CHiME - 2, LibriSpeech and CHiME - 4), proving that the proposed decoupling method is superior to existing baseline systems in noisy and reverberant environments, and also shows good generalization ability on unseen datasets. Through this method, the paper not only improves the performance of ASR in complex acoustic environments, but also reduces the dependence on specific noise types, improving the robustness and adaptability of the system. ### Main contributions 1. **Decoupling front - end and back - end**: A method of completely decoupling front - end enhancement and back - end ASR is proposed, which is different from the traditional practice of directly training ASR models on noisy speech. 2. **Performance improvement**: The effectiveness of this method has been verified on multiple datasets, significantly reducing the Word Error Rate (WER), especially achieving the current best results on the CHiME - 2 and CHiME - 4 datasets. 3. **Generalization ability**: It shows the good generalization ability of this method on different corpora, and can achieve excellent performance even on unseen test data. This decoupling strategy provides new ideas for future research, especially in terms of how to better combine speech enhancement and Automatic Speech Recognition.