A Cross-Entropy-Guided (CEG) Measure for Speech Enhancement Front-End Assessing Performances of Back-End Automatic Speech Recognition

Li Chai,Jun Du,Chin-Hui Lee
DOI: https://doi.org/10.21437/Interspeech.2019-2511
2019-01-01
Abstract:One challenging problem of robust automatic speech recognition (ASR) is how to measure the goodness of a speech enhancement algorithm without calculating word error rate (WER) due to the high costs of manual transcriptions, language modeling and decoding process. In this study, a novel cross-entropy-guided (CEG) measure is proposed for assessing if enhanced speech predicted by a speech enhancement algorithm would produce a good performance for robust ASR. CEG consists of three consecutive steps, namely the low-level representations via the feature extraction, high-level representations via the nonlinear mapping with the acoustic model, and the final CEG calculation between the high-level representations of clean and enhanced speech. Specifically, state posterior probabilities from the output of the neural network for the acoustic model are adopted as the high-level representations and a cross-entropy criterion is used to calculate CEG. Experimental results show that CEG could consistently yield the highest correlations with WER and achieve the most accurate assessment of the ASR performance when compared to distortion measures based on human auditory perception and an acoustic confidence measure. Potentially, CEG could be adopted to guide the parameter optimization of deep learning based speech enhancement algorithms to further improve the ASR performance.
What problem does this paper attempt to address?