Abstract:We present a complimentary objective for training recurrent neural networks (RNN) with gating units that helps with regularization and interpretability of the trained model. Attention-based RNN models have shown success in many difficult sequence to sequence classification problems with long and short term dependencies, however these models are prone to overfitting. In this paper, we describe how to regularize these models through an L1 penalty on the activation of the gating units, and show that this technique reduces overfitting on a variety of tasks while also providing to us a human-interpretable visualization of the inputs used by the network. These tasks include sentiment analysis, paraphrase recognition, and question answering.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are improving the generalization ability of Recurrent Neural Networks (RNNs) in handling sequence classification tasks and the interpretability of the model. Specifically, the author focuses on how to reduce over - fitting by introducing sparsity constraints and provide visual explanations for input data. ### Problem Background Attention - based RNNs have achieved success in many sequence - to - sequence classification problems involving long - and short - term dependencies, such as sentiment analysis, paraphrase identification, and question - answering systems. However, these models are prone to over - fitting, especially when handling complex tasks. ### Solution To improve these problems, the author proposes a new regularization method, which forces the model to selectively use input information by imposing an L1 penalty on the activation of gating units. This method not only helps prevent over - fitting but also makes the model's input selection interpretable to humans. ### Specific Methods 1. **Introducing Sparsity Penalty**: Add a sparsity penalty term to the original training objective function \( J \): \[ J^* = J+\lambda_{\text{sparse}}\cdot\sum_i g_i \] where \( g_i \) is the activation value of the gating unit, and \( \lambda_{\text{sparse}} \) is a hyperparameter that controls the intensity of the sparsity penalty. 2. **Hierarchical Gated LSTM (HG - LSTM)**: The author also introduces a hierarchical - gated LSTM, which can selectively ignore or include information at different levels of abstraction. This structure includes two sub - models: the Fact model and the High - Level model, which are used to handle low - level and high - level information respectively. 3. **Gradually Changing Sparsity Penalty**: To avoid excessive sparsity in the early training stage, the author adopts a gradually changing sparsity penalty strategy, that is, gradually increasing the weight \( \lambda_{\text{sparse}} \) of the sparsity penalty as the training progresses. ### Experimental Results The author verified the effectiveness of this method on three different tasks: - **Sentiment Analysis**: Experiments on the Stanford Sentiment Treebank dataset show that after introducing the sparsity penalty, the performance of the model is significantly improved. - **Paraphrase Identification**: Experiments on the SemEval 2014 dataset show that this method improves the recall rate of the model. - **Question - answering Systems**: Experiments on the Facebook bAbI dataset show that the HG - LSTM structure is superior to the traditional LSTM model in most tasks and achieves better performance in some tasks. ### Conclusion By introducing sparsity penalties and hierarchical - gated structures, the author has successfully improved the generalization ability and interpretability of RNN models on multiple tasks. This provides a new direction for future research, especially in building more efficient and interpretable deep - learning models.

Occam's Gates

Refined Gate: A Simple and Effective Gating Mechanism for Recurrent Units

Gates Are Not What You Need in RNNs

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Faster Training of Very Deep Networks Via p-Norm Gates

Gating creates slow modes and controls phase-space complexity in GRUs and LSTMs

Gated recurrent neural networks discover attention

Recurrent Attention Unit

Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation

Minimal Gated Unit for Recurrent Neural Networks

Adding Attentiveness to the Neurons in Recurrent Neural Networks

Recurrent attention unit: A new gated recurrent unit for long-term memory of important parts in sequential data

Memory Visualization for Gated Recurrent Neural Networks in Speech Recognition

GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

Highway State Gating for Recurrent Highway Networks: improving information flow through time

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Representation of linguistic form and function in recurrent neural networks

Deep Gate Recurrent Neural Network

Semi-tied Units for Efficient Gating in LSTM and Highway Networks

EleAtt-RNN: Adding Attentiveness to Neurons in Recurrent Neural Networks