Joint Training Of Front-End And Back-End Deep Neural Networks For Robust Speech Recognition

Tian Gao,Jun Du,Li-Rong Dai,Chin-Hui Lee
DOI: https://doi.org/10.1109/ICASSP.2015.7178797
2015-01-01
Abstract:Based on the recently proposed speech pre-processing front-end with deep neural networks (DNNs), we first investigate different feature mapping directly from noisy speech via DNN for robust speech recognition. Next, we propose to jointly train a single DNN for both feature mapping and acoustic modeling. In the end, we show that the word error rate (WER) of the jointly trained system could be significantly reduced by the fusion of multiple DNN pre-processing systems which implies that features obtained from different domains of the DNN-enhanced speech signals are strongly complementary. Testing on the Aurora4 noisy speech recognition task our best system with multi-condition training can achieves an average WER of 10.3%, yielding a relative reduction of 16.3% over our previous DNN pre-processing only system with a WER of 12.3%. To the best of our knowledge, this represents the best published result on the Aurora4 task without using any adaptation techniques.
What problem does this paper attempt to address?