Investigation of Monaural Front-End Processing for Robust Speech Recognition Without Retraining or Joint-Training

Zhihao Du,Xueliang Zhang,Jiqing Han
DOI: https://doi.org/10.1109/apsipaasc47483.2019.9023011
2018-01-01
Abstract:There are two effective approaches to improve the performance of an automatic speech recognizer with the front-end processing under noisy condition, one is retraining the acoustic model with the enhanced features, the other is joint-training the acoustic model with the front-end processing model. However, in real life, the automatic speech recognition (ASR) systems are always located in cloud servers but the front-end processing models run locally, which results in the impracticality of the retraining and joint-training strategy for ASR. In this paper, we investigate whether the independent frontend processing can directly improve the performance of a speech recognizer without retraining and joint-training. Three common-used enhancement methods are evaluated in different time-frequency (T-F) domains. Our experiments on CHiME-3 reveal that, with appropriate T-F domains and enhancement methods, the front-end processing can make 35.30% and 11.78% relative word-error-rate (WER) reduction for the Gaussian Mixed Model based (GMM-based) and Deep Neural Network based (DNN-based) recognizer, respectively. For the DNN-based ASR system, we propose using masking-based methods in log-fbank domain to do front-end processing. We find that masking based methods, in general, are better than spectral mapping based methods with respect to WER reduction. In addition, the phases of noisy speech are useless and even harmful to reduce the WER. For generalization capability, the front-end processing can improve the multi-conditional trained ASR system under both matched and unmatched noise condition.
What problem does this paper attempt to address?