A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement

Li Chai,Jun Du,Qing-Feng Liu,Chin-Hui Lee
DOI: https://doi.org/10.1109/taslp.2020.3036783
2021-01-01
Abstract:A new cross-entropy-guided measure (CEGM) is proposed to indirectly assess accuracies of automatic speech recognition (ASR) of degraded speech with a speech enhancement front-end and without directly performing ASR experiments. The proposed CEGM is calculated in three steps, namely: (1) a low-level representations via feature extraction, (2) a high-level nonlinear mapping using an acoustic model, and (3) a final CEGM calculation between the high-level representations of clean and enhanced speech. Specifically, state posterior probabilities from outputs of conventional hybrid acoustic model of the target ASR system are adopted as the high-level representations and a cross-entropy criterion is used to calculate the CEGM. Due to CEGM's differentiability, it can also be used to replace the conventional minimum mean squared error (MMSE) criterion as an objective function for deep neural network (DNN)-based speech enhancement. Therefore, the front-end enhancement model can be optimized towards improving the accuracies of the back-end ASR system. Experiments on single-channel CHiME-4 Challenge show that CEGM yields consistently the highest correlations with word error rate (WER) which is often costly to calculate, and achieves the most accurate assessment of ASR performance when compared to the perceptual evaluation metrics commonly used for assessing speech enhancement performance. Furthermore, CEGM-optimized speech enhancement could effectively reduce the WER on the CHiME-4 real test set when compared to unprocessed noisy speech and enhanced speech obtained with MMSE-optimized enhancement for ASR systems with fixed multi-condition acoustic models in various deep architectures.
What problem does this paper attempt to address?