LSTM with Working Memory

Andrew Pulver,Siwei Lyu
DOI: https://doi.org/10.48550/arXiv.1605.01988
2017-03-31
Abstract:Previous RNN architectures have largely been superseded by LSTM, or "Long Short-Term Memory". Since its introduction, there have been many variations on this simple design. However, it is still widely used and we are not aware of a gated-RNN architecture that outperforms LSTM in a broad sense while still being as simple and efficient. In this paper we propose a modified LSTM-like architecture. Our architecture is still simple and achieves better performance on the tasks that we tested on. We also introduce a new RNN performance benchmark that uses the handwritten digits and stresses several important network capabilities.
Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key issues in the existing LSTM (Long - Short - Term Memory network) architecture in order to improve its performance in handling sequence data tasks. Specifically, the paper points out that LSTM has the following three main problems: 1. **Exponential decay of memory caused by the forget gate**: The forget gate imposes an exponential decay on the memory unit, which may be inappropriate in some cases. For example, when the model needs to maintain certain information for a long time, this exponential decay may prematurely weaken the importance of this information. 2. **Limited information exchange between memory units**: Memory units cannot directly communicate or exchange information unless the input and output gates are opened. This limits the information flow within the memory unit, making it difficult for the model to effectively manage complex internal states. 3. **Saturation problem of the hyperbolic tangent activation function**: LSTM uses hyperbolic tangent ($\tanh$) as an activation function. When the input value is large, the gradient of the $\tanh$ function becomes very small, resulting in the vanishing gradient problem and thus affecting the training effect. To solve these problems, the author proposes an improved LSTM architecture, called LSTM with Working Memory (LSTWM). The main improvements of LSTWM include: - **Replacing the forget gate with a functional layer**: LSTWM introduces a functional layer located between the input gate and the output gate. It combines the current memory unit value with the output of this functional layer through a convex combination, instead of simply multiplying by the output of the forget gate. - **Using a logarithm - based activation function**: LSTWM attempts to use a logarithm - based activation function to avoid the saturation problem of traditional activation functions (such as $\tanh$) under large input values, thereby improving the performance of the model. - **Enhancing information exchange within the memory unit**: By introducing an additional functional layer, LSTWM allows more flexible information exchange between memory units without relying on the on - off states of the input and output gates. To verify the effectiveness of these improvements, the author conducted experiments on multiple tasks, including text prediction tasks and a task that combines number recognition and addition. The experimental results show that LSTWM exhibits better performance on these tasks, especially when using the logarithm - based activation function. In summary, the main objective of this paper is to overcome the limitations of LSTM in handling long - term dependencies and complex sequence data by improving the LSTM architecture, thereby improving the performance and efficiency of the model.