Context-dependent Deep Neural Networks for audio indexing of real-life data

Gang Li,Huifeng Zhu,Gong Cheng,Kit Thambiratnam,Behrooz Chitsaz,Dong Yu,Frank Seide
DOI: https://doi.org/10.1109/SLT.2012.6424212
2012-01-01
Abstract:We apply Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, to the real-life problem of audio indexing of data across various sources. Recently, we had shown that on the Switchboard benchmark on speaker-independent transcription of phone calls, CD-DNN-HMMs with 7 hidden layers reduce the word error rate by as much as one-third, compared to discriminatively trained Gaussian-mixture HMMs, and by one-fourth if the GMM-HMM also uses fMPE features. This paper takes CD-DNN-HMM based recognition into a real-life deployment for audio indexing. We find that for our best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data (video podcasts and talks). Compared to a speaker-adaptive GMM system, the relative improvement is 18%, at very similar end-to-end runtime. In system building, we find that DNNs can benefit from a larger number of senones than the GMM-HMM; and that DNN likelihood evaluation is a sizeable runtime factor even in our wide-beam context of generating rich lattices: Cutting the model size by 60% reduces runtime by one-third at a 5% relative WER loss.
What problem does this paper attempt to address?