Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

Jia Pan,Cong Liu,Zhiguo Wang,Yu Hu,Hui Jiang
DOI: https://doi.org/10.1109/iscslp.2012.6423452
2012-12-01
Abstract:Recently, it has been reported that context-dependent deep neural network (DNN) has achieved some unprecedented gains in many challenging ASR tasks, including the well-known Switchboard task. In this paper, we first investigate DNN for several large vocabulary speech recognition tasks. Our results have confirmed that DNN can consistently achieve about 25–30% relative error reduction over the best discriminatively trained GMMs even in some ASR tasks with up to 700 hours of training data. Next, we have conducted a series of experiments to study where the unprecedented gain of DNN comes from. Our experiments show the gain of DNN is almost entirely attributed to DNN's feature vectors that are concatenated from several consecutive speech frames within a relatively long context window. At last, we have proposed a few ideas to reconfigure the DNN input features, such as using logarithm spectrum features or VTLN normalized features in DNN. Our results have shown that each of these methods yields over 3% relative error reduction over the traditional MFCC or PLP features in DNN.
What problem does this paper attempt to address?