Language Modeling with Highway LSTM

Gakuto Kurata,Bhuvana Ramabhadran,George Saon,Abhinav Sethy
DOI: https://doi.org/10.48550/arXiv.1709.06436
2017-09-19
Abstract:Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks. In this paper, we extend an LSTM by adding highway networks inside an LSTM and use the resulting Highway LSTM (HW-LSTM) model for language modeling. The added highway networks increase the depth in the time dimension. Since a typical LSTM has two internal states, a memory cell and a hidden state, we compare various types of HW-LSTM by adding highway networks onto the memory cell and/or the hidden state. Experimental results on English broadcast news and conversational telephone speech recognition show that the proposed HW-LSTM LM improves speech recognition accuracy on top of a strong LSTM LM baseline. We report 5.1% and 9.9% on the Switchboard and CallHome subsets of the Hub5 2000 evaluation, which reaches the best performance numbers reported on these tasks to date.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to enhance the performance of Long - Short - Term Memory networks (LSTM) in language modeling by introducing Highway Networks, thereby improving the accuracy of Automatic Speech Recognition (ASR) tasks. Specifically, the authors propose three different Highway - LSTM (HW - LSTM) variants: HW - LSTM - C, HW - LSTM - H and HW - LSTM - CH. These variants add Highway Networks to the memory cell or hidden state of LSTM respectively, or add them to both parts simultaneously. The main contributions of the paper include: 1. Proposing a new language modeling technique, namely using HW - LSTM. 2. Designing a method for training HW - LSTM language models. This method first uses a regular LSTM for pre - training, and then converts it into HW - LSTM by adding highway connections and continues training. 3. Demonstrating the application effects of the above - mentioned method in broadcast news and conversational telephone speech recognition tasks based on public data sets, especially achieving the best reported accuracy on the Switchboard and CallHome subsets currently. The experimental results show that the HW - LSTM - H variant performs best in reducing the Word Error Rate (WER), especially when using deep Highway Networks, which can further improve the performance of the model. In addition, the study also found that regular LSTM and HW - LSTM can be used complementarily, and combining the two can further reduce the WER.