Persistence pays off: Paying Attention to What the LSTM Gating Mechanism Persists

Giancarlo D. Salton,John D. Kelleher
DOI: https://doi.org/10.48550/arXiv.1810.04437
2018-10-10
Abstract:Language Models (LMs) are important components in several Natural Language Processing systems. Recurrent Neural Network LMs composed of LSTM units, especially those augmented with an external memory, have achieved state-of-the-art results. However, these models still struggle to process long sequences which are more likely to contain long-distance dependencies because of information fading and a bias towards more recent information. In this paper we demonstrate an effective mechanism for retrieving information in a memory augmented LSTM LM based on attending to information in memory in proportion to the number of timesteps the LSTM gating mechanism persisted the information.
Machine Learning
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper aims to solve the problem of information attenuation in long - sequence processing, especially in language models (LMs). Specifically, the recurrent neural network language model based on LSTM units (RNN - LM), although performing well in many natural language processing tasks, still faces challenges when processing long sequences. These challenges include: 1. **Information attenuation**: As the sequence length increases, early information may gradually disappear, especially in the case of long - distance dependencies (LDD). 2. **Bias towards recent information**: The hidden state of LSTM may be dominated by the most recent information, causing early important information to be ignored. To solve these problems, the authors propose a new mechanism to improve the memory - augmented LSTM - LM by paying attention to the information retained by the LSTM gating mechanism. Specifically, they propose a simple and effective method, that is, weighting the information according to the length of time that the LSTM gating mechanism retains the information over multiple time steps. ### Overview of the solution The main contribution of the authors is to show an effective mechanism for retrieving historical information from the memory - augmented LSTM language model. The specific methods are as follows: - **Weighting persistent information**: Weight the information retained by the LSTM gating mechanism over multiple time steps to reflect its importance. This strategy reinforces the decision - making of the LSTM gating mechanism at each time step on the important parts of the sequence. - **Simplifying the attention mechanism**: Different from previous methods that use additional neural networks to predict which elements should be retrieved from the memory buffer, the authors propose a simple averaging method that directly uses the decisions of the LSTM gating mechanism to construct the historical representation. Through this method, the authors show that their model can achieve results close to or reaching the state - of - the - art on the Penn Treebank and wikitext2 datasets, while having fewer parameters. ### Summary of the main formulas 1. **Basic calculation formulas of LSTM units**: \[ \tilde{c}_t=\tanh(W_x x_t + W_h h_{t - 1}+b) \] \[ i_t=\sigma(W_{ii}x_t + W_{ih}h_{t - 1}+b_i) \] \[ f_t=\sigma(W_{if}x_t + W_{fh}h_{t - 1}+b_f) \] \[ o_t=\sigma(W_{io}x_t + W_{oh}h_{t - 1}+b_o) \] \[ c_t=f_t\times c_{t - 1}+i_t\times\tilde{c}_t \] \[ h_t=o_t\times\tanh(c_t) \] 2. **Average calculation of historical representation**: \[ c_t=\frac{1}{t}\sum_{i = 0}^{t - 1}h_i \] 3. **Calculation of final prediction**: \[ p(w_t|w_{<t},x)=\text{softmax}(W_s h'_t + b) \] where, \[ h'_t=\tanh(W_c[h_t;c_t]+b_t) \] ### Conclusion Through this method, the authors prove the effectiveness of using the decisions of the LSTM gating mechanism to weight persistent information and show its advantages in processing long sequences. This method not only simplifies the model structure but also improves performance, especially in dealing with long - distance dependencies.