Modeling long temporal contexts for robust DNN-based speech recognition

Bo Li,K. Sim
DOI: https://doi.org/10.21437/Interspeech.2014-83
2014-01-01
Abstract:Deep Neural Networks (DNNs) have been shown to outperform traditional Gaussian Mixture Models in many Automatic Speech Recognition tasks. In this work, we investigate the potential of modeling long temporal acoustic contexts using DNNs. The complete temporal context is split into several sub-contexts. Multiple sub-context DNNs initialized with the same set of Restricted Boltzmann Machines are fine-tuned independently and their last hidden layer activations are combined to jointly predict the desired state posteriors through a single soft-max output layer. From preliminary experiments on the Au-rora2 multi-style training task, our proposed system models a 65-frame temporal window of speech signals and yields a 4.4% WER, outperforming the best single DNN by 12.0% relatively. With the local independence assumption, both training and testing of the sub-context DNNs can be done in parallel. Moreover, our system has a relative 48.2% parameter reduction compared to a single DNN with the same amount of hidden units.
Computer Science
What problem does this paper attempt to address?