Abstract:We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments which yield high order polynomial rates of approximation, sequences of ReLU neural networks with exponentially many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.

What problem does this paper attempt to address?

This paper discusses the problem caused by the numerical instability of the backpropagation algorithm during neural network training. The researchers point out that it is very unlikely to find ReLU neural networks that maintain superlinear quantities of affine pieces during the training process, especially in deep learning, where gradient calculation is done using floating-point numbers. These affine pieces are used to achieve high-order polynomial approximation rates in many theoretical approximation arguments. The paper demonstrates, through theoretical analysis and numerical experiments, that there are significant differences between the ReLU neural network sequences obtained by gradient descent in practice and those constructed theoretically. This is mainly because in order to achieve high-precision approximation theoretically, neural networks typically need exponentially many affine pieces. However, it is not feasible to generate such networks in practice due to rounding errors in floating-point computations. The authors mention a key point that in floating-point computations, the training of neural networks can be limited due to the phenomenon of "catastrophic cancellation". Even in small-scale networks, finite precision can lead to relative errors reaching 1, thereby affecting the accuracy of the network. They also propose a hypothesis that small and controllable relative errors occurring in each layer during each iteration would make it unlikely for the neural network to have exponentially many affine pieces. The main contributions of the paper are as follows: 1. It provides a framework for analyzing the impact of floating-point computations on gradient descent learning and defines a gradient descent process with noisy updates. 2. It assumes two conditions: (A) the average inverse of weight updates is bounded by a polynomial in the maximum number of neurons in a single layer, and dead neurons remain unchanged after iterations; (B) the derivatives of each neuron's inputs with respect to the network inputs are bounded by a constant. 3. It proves a theorem showing that after a single perturbation, the expected number of linear pieces in the neural network has a polynomial upper bound, which depends on the number of neurons, layers, and perturbation level. 4. The paper also conducts numerical analysis to validate these theoretical results and finds that numerical stability is one of the reasons preventing the learning of neural networks with exponentially many linear pieces, but there may be other factors that prevent actual networks from reaching the theoretical threshold. In summary, this paper reveals the limitations of numerical instability in deep learning practices on the training effectiveness of neural networks, highlighting the gap between theoretical analysis and practical training.

Limitations of neural network training due to numerical instability of backpropagation

Limitations of neural network training due to numerical instability of backpropagation

Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

Numerical influence of ReLU'(0) on backpropagation

Topological obstruction to the training of shallow ReLU neural networks

Training a Two Layer ReLU Network Analytically

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

The Disharmony between BN and ReLU Causes Gradient Explosion, but is Offset by the Correlation between Activations

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

Equidistribution-based training of Free Knot Splines and ReLU Neural Networks

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Neural networks with ReLU powers need less depth

Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

How many Neurons do we need? A refined Analysis for Shallow Networks trained with Gradient Descent

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Why ReLU Units Sometimes Die: Analysis of Single-Unit Error Backpropagation in Neural Networks

Gradient descent provably escapes saddle points in the training of shallow ReLU networks

Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron