Limitations of neural network training due to numerical instability of backpropagation

Clemens Karner,Vladimir Kazeev,Philipp Christian Petersen
DOI: https://doi.org/10.1007/s10444-024-10106-x
2024-02-13
Advances in Computational Mathematics
Abstract:We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments which yield high order polynomial rates of approximation, sequences of ReLU neural networks with exponentially many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.
mathematics, applied
What problem does this paper attempt to address?
This paper discusses the problem caused by the numerical instability of the backpropagation algorithm during neural network training. The researchers point out that it is very unlikely to find ReLU neural networks that maintain superlinear quantities of affine pieces during the training process, especially in deep learning, where gradient calculation is done using floating-point numbers. These affine pieces are used to achieve high-order polynomial approximation rates in many theoretical approximation arguments. The paper demonstrates, through theoretical analysis and numerical experiments, that there are significant differences between the ReLU neural network sequences obtained by gradient descent in practice and those constructed theoretically. This is mainly because in order to achieve high-precision approximation theoretically, neural networks typically need exponentially many affine pieces. However, it is not feasible to generate such networks in practice due to rounding errors in floating-point computations. The authors mention a key point that in floating-point computations, the training of neural networks can be limited due to the phenomenon of "catastrophic cancellation". Even in small-scale networks, finite precision can lead to relative errors reaching 1, thereby affecting the accuracy of the network. They also propose a hypothesis that small and controllable relative errors occurring in each layer during each iteration would make it unlikely for the neural network to have exponentially many affine pieces. The main contributions of the paper are as follows: 1. It provides a framework for analyzing the impact of floating-point computations on gradient descent learning and defines a gradient descent process with noisy updates. 2. It assumes two conditions: (A) the average inverse of weight updates is bounded by a polynomial in the maximum number of neurons in a single layer, and dead neurons remain unchanged after iterations; (B) the derivatives of each neuron's inputs with respect to the network inputs are bounded by a constant. 3. It proves a theorem showing that after a single perturbation, the expected number of linear pieces in the neural network has a polynomial upper bound, which depends on the number of neurons, layers, and perturbation level. 4. The paper also conducts numerical analysis to validate these theoretical results and finds that numerical stability is one of the reasons preventing the learning of neural networks with exponentially many linear pieces, but there may be other factors that prevent actual networks from reaching the theoretical threshold. In summary, this paper reveals the limitations of numerical instability in deep learning practices on the training effectiveness of neural networks, highlighting the gap between theoretical analysis and practical training.