Abstract:The ability of neural networks to provide `best in class' approximation across a wide range of applications is well-documented. Nevertheless, the powerful expressivity of neural networks comes to naught if one is unable to effectively train (choose) the parameters defining the network. In general, neural networks are trained by gradient descent type optimization methods, or a stochastic variant thereof. In practice, such methods result in the loss function decreases rapidly at the beginning of training but then, after a relatively small number of steps, significantly slow down. The loss may even appear to stagnate over the period of a large number of epochs, only to then suddenly start to decrease fast again for no apparent reason. This so-called plateau phenomenon manifests itself in many learning tasks. The present work aims to identify and quantify the root causes of plateau phenomenon. No assumptions are made on the number of neurons relative to the number of training data, and our results hold for both the lazy and adaptive regimes. The main findings are: plateaux correspond to periods during which activation patterns remain constant, where activation pattern refers to the number of data points that activate a given neuron; quantification of convergence of the gradient flow dynamics; and, characterization of stationary points in terms solutions of local least squares regression lines over subsets of the training data. Based on these conclusions, we propose a new iterative training method, the Active Neuron Least Squares (ANLS), characterised by the explicit adjustment of the activation pattern at each step, which is designed to enable a quick exit from a plateau. Illustrative numerical examples are included throughout.

Trainability of ReLU networks and Data-dependent Initialization

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Why ReLU Units Sometimes Die: Analysis of Single-Unit Error Backpropagation in Neural Networks

Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers

Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

Phase Diagram for Two-layer ReLU Neural Networks at Infinite-width Limit.

The effect of Target Normalization and Momentum on Dying ReLU

Stably unactivated neurons in ReLU neural networks

Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization

Neural networks with ReLU powers need less depth

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

Improving performance of recurrent neural network with relu nonlinearity

ReLUs Are Sufficient for Learning Implicit Neural Representations

A theoretical framework for deep locally connected ReLU network

Principles for Initialization and Architecture Selection in Graph Neural Networks with ReLU Activations

Improved weight initialization for deep and narrow feedforward neural network

Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks

Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width

The Computational Complexity of ReLU Network Training Parameterized by Data Dimensionality