Abstract:When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **When and why can neural networks be successfully trained?** Specifically, the article explores the following aspects of problems: 1. **The problem of exploding/vanishing gradients and its broader ill - conditioned spectrum problem**: - In the training process of deep neural networks, the gradient may increase rapidly (explode) or decrease (vanish) as the network depth increases, leading to training difficulties. For example, in a simple one - dimensional problem: \[ F(w)=0.5 (w_1 w_2 \cdots w_L - 1)^2 \] If all \( w_j = 2 \), then the norm of the gradient is \( 2^{L - 1}|e|\), which will be very large; if all \( w_j=\frac{1}{2}\), then the norm of the gradient is \( 0.5^{L - 1}e\), which will be very small. - This problem of exploding/vanishing gradients will cause problems in weight updates, thus affecting the convergence speed and effect of training. 2. **The choice of optimization algorithms**: - The article reviews various general - purpose optimization methods for training neural networks, such as stochastic gradient descent (SGD), adaptive gradient methods, and distributed training methods, and discusses the existing theoretical results of these algorithms. For example, SGD is a commonly used optimization method, and its basic form is: \[ \theta_{t + 1}=\theta_t-\eta_t\nabla F(\theta_t) \] where \(\eta_t\) is the learning rate and \(\nabla F(\theta_t)\) is the gradient of the loss function. 3. **The global optimization problem**: - The article also reviews research on global problems in neural network training, including bad local minima, mode connectivity, the lottery hypothesis, and infinite - width analysis. These problems are related to the global optimal solution and the characteristics of the optimization landscape in neural network training. To address these problems, the article proposes several practical solutions, including: - **Initialization strategies**: such as LeCun initialization, Xavier initialization, etc., to ensure that the initial weights are within a suitable range to avoid exploding/vanishing gradients. - **Normalization methods**: such as batch normalization, which helps to stabilize and accelerate the training process. - **Skip connections**: such as the skip connections in ResNet, which can alleviate the vanishing gradient problem in deep networks. Through these methods, the article aims to provide a theoretical understanding and practical operation guidelines for neural network optimization, helping researchers better understand and solve the challenges in neural network training.

Optimization for deep learning: theory and algorithms

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

Optimization Methods in Deep Learning: A Comprehensive Overview

Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization

Effective Neural Network Training with a New Weighting Mechanism-Based Optimization Algorithm.

A Comparison of Optimization Algorithms for Deep Learning

Artificial Neural Network and Deep Learning: Fundamentals and Theory

An Efficient Optimization Technique for Training Deep Neural Networks

Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant

Understanding the Role of Optimization in Double Descent

Gradient Descent, Stochastic Optimization, and Other Tales

An Essay on Optimization Mystery of Deep Learning

Optimization Algorithm Inspired Deep Neural Network Structure Design

Empirical Tests of Optimization Assumptions in Deep Learning

Optimization of deep learning models: benchmark and analysis

Gradient Descent Optimization in Deep Learning Model Training Based on Multistage and Method Combination Strategy

Computational issues in Optimization for Deep networks

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

A Comprehensive Study on Optimization Strategies for Gradient Descent In Deep Learning

Research on Optimization of Image Recognition Algorithm Based on Deep Learning

Old Optimizer, New Norm: An Anthology