Abstract:Language Models (LMs) are important components in several Natural Language Processing systems. Recurrent Neural Network LMs composed of LSTM units, especially those augmented with an external memory, have achieved state-of-the-art results. However, these models still struggle to process long sequences which are more likely to contain long-distance dependencies because of information fading and a bias towards more recent information. In this paper we demonstrate an effective mechanism for retrieving information in a memory augmented LSTM LM based on attending to information in memory in proportion to the number of timesteps the LSTM gating mechanism persisted the information.

What problem does this paper attempt to address?

### The problems the paper attempts to solve This paper aims to solve the problem of information attenuation in long - sequence processing, especially in language models (LMs). Specifically, the recurrent neural network language model based on LSTM units (RNN - LM), although performing well in many natural language processing tasks, still faces challenges when processing long sequences. These challenges include: 1. **Information attenuation**: As the sequence length increases, early information may gradually disappear, especially in the case of long - distance dependencies (LDD). 2. **Bias towards recent information**: The hidden state of LSTM may be dominated by the most recent information, causing early important information to be ignored. To solve these problems, the authors propose a new mechanism to improve the memory - augmented LSTM - LM by paying attention to the information retained by the LSTM gating mechanism. Specifically, they propose a simple and effective method, that is, weighting the information according to the length of time that the LSTM gating mechanism retains the information over multiple time steps. ### Overview of the solution The main contribution of the authors is to show an effective mechanism for retrieving historical information from the memory - augmented LSTM language model. The specific methods are as follows: - **Weighting persistent information**: Weight the information retained by the LSTM gating mechanism over multiple time steps to reflect its importance. This strategy reinforces the decision - making of the LSTM gating mechanism at each time step on the important parts of the sequence. - **Simplifying the attention mechanism**: Different from previous methods that use additional neural networks to predict which elements should be retrieved from the memory buffer, the authors propose a simple averaging method that directly uses the decisions of the LSTM gating mechanism to construct the historical representation. Through this method, the authors show that their model can achieve results close to or reaching the state - of - the - art on the Penn Treebank and wikitext2 datasets, while having fewer parameters. ### Summary of the main formulas 1. **Basic calculation formulas of LSTM units**: \[ \tilde{c}_t=\tanh(W_x x_t + W_h h_{t - 1}+b) \] \[ i_t=\sigma(W_{ii}x_t + W_{ih}h_{t - 1}+b_i) \] \[ f_t=\sigma(W_{if}x_t + W_{fh}h_{t - 1}+b_f) \] \[ o_t=\sigma(W_{io}x_t + W_{oh}h_{t - 1}+b_o) \] \[ c_t=f_t\times c_{t - 1}+i_t\times\tilde{c}_t \] \[ h_t=o_t\times\tanh(c_t) \] 2. **Average calculation of historical representation**: \[ c_t=\frac{1}{t}\sum_{i = 0}^{t - 1}h_i \] 3. **Calculation of final prediction**: \[ p(w_t|w_{<t},x)=\text{softmax}(W_s h'_t + b) \] where, \[ h'_t=\tanh(W_c[h_t;c_t]+b_t) \] ### Conclusion Through this method, the authors prove the effectiveness of using the decisions of the LSTM gating mechanism to weight persistent information and show its advantages in processing long sequences. This method not only simplifies the model structure but also improves performance, especially in dealing with long - distance dependencies.

Persistence pays off: Paying Attention to What the LSTM Gating Mechanism Persists

Dependency-based Siamese Long Short-Term Memory Network for Learning Sentence Representations.

Learning Longer Memory in Recurrent Neural Networks

Learning to Forget: Continual Prediction with LSTM

ELSTM: An improved long short‐term memory network language model for sequence learning

Working Memory Connections for LSTM

Extending Memory for Language Modelling

On extended long short-term memory and dependent bidirectional recurrent neural network

Augmenting Language Models with Long-Term Memory

NEWLSTM: an Optimized Long Short-Term Memory Language Model for Sequence Prediction.

Memory Visualization for Gated Recurrent Neural Networks in Speech Recognition

xLSTM: Extended Long Short-Term Memory

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Continual Learning Long Short Term Memory.

A review on the long short-term memory model

Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

Dynamic temporal residual network for sequence modeling

RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models

Semiparametric Language Models Are Scalable Continual Learners

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Recurrent Memory Networks for Language Modeling