Error Back Propagation for Sequence Training of Context-Dependent Deep NetworkS for Conversational Speech Transcription

Hang Su,Gang Li,Dong Yu,Frank Seide
DOI: https://doi.org/10.1109/icassp.2013.6638951
2013-01-01
Abstract:We investigate back-propagation based sequence training of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, for conversational speech transcription. Theoretically, sequence training integrates with backpropagation in a straight-forward manner. However, we find that to get reasonable results, heuristics are needed that point to a problem with lattice sparseness: The model must be adjusted to the updated numerator lattices by additional iterations of frame-based cross-entropy (CE) training; and to avoid distortions from “runaway” models, we can either add artificial silence arcs to the denominator lattices, or smooth the sequence objective with the frame-based one (F-smoothing). With the 309h Switchboard training set, the MMI objective achieves a relative word-error rate reduction of 11-15% over CE for matched test sets, and 10-17% for mismatched ones. This includes gains of 4-7% from realigned CE iterations. The BMMI and sMBR objectives gain less. With 2000h of data, gains are 2-9% after realigned CE iterations. Using GPGPUs, MMI is about 70% slower than CE training.
What problem does this paper attempt to address?