Occam's Gates

Jonathan Raiman,Szymon Sidor
DOI: https://doi.org/10.48550/arXiv.1506.08251
2015-06-27
Abstract:We present a complimentary objective for training recurrent neural networks (RNN) with gating units that helps with regularization and interpretability of the trained model. Attention-based RNN models have shown success in many difficult sequence to sequence classification problems with long and short term dependencies, however these models are prone to overfitting. In this paper, we describe how to regularize these models through an L1 penalty on the activation of the gating units, and show that this technique reduces overfitting on a variety of tasks while also providing to us a human-interpretable visualization of the inputs used by the network. These tasks include sentiment analysis, paraphrase recognition, and question answering.
Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are improving the generalization ability of Recurrent Neural Networks (RNNs) in handling sequence classification tasks and the interpretability of the model. Specifically, the author focuses on how to reduce over - fitting by introducing sparsity constraints and provide visual explanations for input data. ### Problem Background Attention - based RNNs have achieved success in many sequence - to - sequence classification problems involving long - and short - term dependencies, such as sentiment analysis, paraphrase identification, and question - answering systems. However, these models are prone to over - fitting, especially when handling complex tasks. ### Solution To improve these problems, the author proposes a new regularization method, which forces the model to selectively use input information by imposing an L1 penalty on the activation of gating units. This method not only helps prevent over - fitting but also makes the model's input selection interpretable to humans. ### Specific Methods 1. **Introducing Sparsity Penalty**: Add a sparsity penalty term to the original training objective function \( J \): \[ J^* = J+\lambda_{\text{sparse}}\cdot\sum_i g_i \] where \( g_i \) is the activation value of the gating unit, and \( \lambda_{\text{sparse}} \) is a hyperparameter that controls the intensity of the sparsity penalty. 2. **Hierarchical Gated LSTM (HG - LSTM)**: The author also introduces a hierarchical - gated LSTM, which can selectively ignore or include information at different levels of abstraction. This structure includes two sub - models: the Fact model and the High - Level model, which are used to handle low - level and high - level information respectively. 3. **Gradually Changing Sparsity Penalty**: To avoid excessive sparsity in the early training stage, the author adopts a gradually changing sparsity penalty strategy, that is, gradually increasing the weight \( \lambda_{\text{sparse}} \) of the sparsity penalty as the training progresses. ### Experimental Results The author verified the effectiveness of this method on three different tasks: - **Sentiment Analysis**: Experiments on the Stanford Sentiment Treebank dataset show that after introducing the sparsity penalty, the performance of the model is significantly improved. - **Paraphrase Identification**: Experiments on the SemEval 2014 dataset show that this method improves the recall rate of the model. - **Question - answering Systems**: Experiments on the Facebook bAbI dataset show that the HG - LSTM structure is superior to the traditional LSTM model in most tasks and achieves better performance in some tasks. ### Conclusion By introducing sparsity penalties and hierarchical - gated structures, the author has successfully improved the generalization ability and interpretability of RNN models on multiple tasks. This provides a new direction for future research, especially in building more efficient and interpretable deep - learning models.