Abstract:Numerous theories of learning propose to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or simple single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory we call the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks, for differentiable, neuromorphic and state-space models to fulfill the LSC, often results in improved final performance. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

On the Provable Generalization of Recurrent Neural Networks

Towards Interpreting Recurrent Neural Networks Through Probabilistic Abstraction

Residual Recurrent Neural Networks for Learning Sequential Representations.

Understanding Generalization in Recurrent Neural Networks.

Generalization and Risk Bounds for Recurrent Neural Networks

Linear RNNs Provably Learn Linear Dynamical Systems

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets

Linear RNNs Provably Learn Linear Dynamic Systems

Making Neural Programming Architectures Generalize via Recursion

Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks

Exploring the Long-Term Generalization of Counting Behavior in RNNs

A Generalization of Recurrent Neural Networks for Graph Embedding.

Do highly over-parameterized neural networks generalize since bad solutions are rare?

Inverse Approximation Theory for Nonlinear Recurrent Neural Networks

Absolute Exponential Stability of Recurrent Neural Networks With Generalized Activation Function

Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics

Stabilizing RNN Gradients through Pre-training

Dropout Training, Data-dependent Regularization, and Generalization Bounds.

Batch Normalized Recurrent Neural Networks

Generalization Ability of Wide Neural Networks on $\mathbb{R}$