Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network

Alex Sherstinsky
DOI: https://doi.org/10.1016/j.physd.2019.132306
2023-07-31
Abstract:Because of their effectiveness in broad practical applications, LSTM networks have received a wealth of coverage in scientific journals, technical blogs, and implementation guides. However, in most articles, the inference formulas for the LSTM network and its parent, RNN, are stated axiomatically, while the training formulas are omitted altogether. In addition, the technique of "unrolling" an RNN is routinely presented without justification throughout the literature. The goal of this paper is to explain the essential RNN and LSTM fundamentals in a single document. Drawing from concepts in signal processing, we formally derive the canonical RNN formulation from differential equations. We then propose and prove a precise statement, which yields the RNN unrolling technique. We also review the difficulties with training the standard RNN and address them by transforming the RNN into the "Vanilla LSTM" network through a series of logical arguments. We provide all equations pertaining to the LSTM system together with detailed descriptions of its constituent entities. Albeit unconventional, our choice of notation and the method for presenting the LSTM system emphasizes ease of understanding. As part of the analysis, we identify new opportunities to enrich the LSTM system and incorporate these extensions into the Vanilla LSTM network, producing the most general LSTM variant to date. The target reader has already been exposed to RNNs and LSTM networks through numerous available resources and is open to an alternative pedagogical approach. A Machine Learning practitioner seeking guidance for implementing our new augmented LSTM model in software for experimentation and research will find the insights and derivations in this tutorial valuable as well.
Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the issue of insufficient theoretical and conceptual explanations of Recurrent Neural Networks (RNNs) and their Long Short-Term Memory (LSTM) variants. Specifically, the goals of the paper are: 1. **Formal Derivation**: Formally derive the classical RNN equations starting from differential equations and prove the rationality of the RNN unfolding technique. 2. **Completeness and Generality**: Provide complete inference and training formulas covering all system components, focusing on the most general form of the LSTM system (i.e., "vanilla LSTM"), including the influence of cell states on control nodes (the so-called "peephole connections"). 3. **Intuitive and Understandable**: Use descriptive and clearly meaningful symbols to explain facts and fundamental principles, eliminating confusion and misunderstandings. 4. **Modularity**: Describe the LSTM unit in a way that it can easily be part of a pluggable architecture, whether in the horizontal direction as "deep sequences" or in the vertical direction as "deep representations." 5. **Vector Representation**: Express equations in the form of matrices and vectors so that these equations can be directly inserted into matrix software libraries like numpy. The paper first introduces the basic theoretical background of RNNs, then rigorously introduces the RNN unfolding technique through mathematical proofs. Next, based on the numerical difficulties encountered when training long sequences, the paper gradually constructs the vanilla LSTM unit to improve the robustness of standard RNNs. The paper also provides a detailed explanation of all aspects of the vanilla LSTM unit and proposes an enhanced LSTM system based on this. Overall, this paper fills a gap—a lack of a comprehensive, self-contained introduction that can clearly and concisely explain the vanilla LSTM computational unit, accompanied by clearly labeled unit diagrams and sequence diagrams. Additionally, the paper offers suggestions for future projects.