Abstract:Deep Learning optimization involves minimizing a high-dimensional loss function in the weight space which is often perceived as difficult due to its inherent difficulties such as saddle points, local minima, ill-conditioning of the Hessian and limited compute resources. In this paper, we provide a comprehensive review of $14$ standard optimization methods successfully used in deep learning research and a theoretical assessment of the difficulties in numerical optimization from the optimization literature.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges encountered in deep - learning optimization, especially how to effectively minimize high - dimensional loss functions. Specifically, the author reviews and evaluates 14 standard optimization methods, which have been successfully applied in deep - learning research, and theoretically evaluates the difficulties in numerical optimization from the perspective of the optimization literature. ### Main problems include: 1. **Saddle points and local minima**: The loss functions in deep learning are usually highly non - convex, with a large number of saddle points and local minima, which makes the optimization process complex and prone to getting stuck in sub - optimal solutions. 2. **Ill - conditioned Hessian matrix**: The Hessian matrix may have a poor condition number, resulting in unstable or too - slow gradient updates during the optimization process. 3. **Limited computing resources**: Since deep - learning models usually have a large number of parameters, the optimization process requires efficient algorithms to reduce the consumption of computing resources. ### Main objectives of the paper: - Provide a comprehensive review, covering first - order and second - order optimization methods. - Evaluate the effectiveness and limitations of various optimization methods in dealing with the above - mentioned problems. - Provide guidance for researchers to select appropriate optimization methods, especially when facing different types of non - convex optimization problems. ### Overview of specific content: - **First - order methods**: Such as Stochastic Gradient Descent (SGD), momentum method, Nesterov momentum, AdaGrad, RMSProp and Adam, etc. These methods mainly rely on gradient information for parameter updates. - **Second - order methods**: Such as Newton's method, quasi - Newton method (BFGS), etc. These methods use the information of the Hessian matrix to accelerate convergence, but the computational cost is high. By comparing these methods, the paper aims to help readers understand the advantages and disadvantages of each method and provide valuable references for practical applications.

A survey of deep learning optimizers -- first and second order methods

Optimization Methods in Deep Learning: A Comprehensive Overview

Scalable Second Order Optimization for Deep Learning

Old Optimizer, New Norm: An Anthology

Empirical Tests of Optimization Assumptions in Deep Learning

Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant

The loss landscape of deep linear neural networks: a second-order analysis

A Comprehensive Study on Optimization Strategies for Gradient Descent In Deep Learning

A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

A Survey of Optimization Methods From a Machine Learning Perspective

Computational issues in Optimization for Deep networks

Second-order Neural Network Training Using Complex-step Directional Derivative

Optimization for deep learning: theory and algorithms

Open Problems in Applied Deep Learning

Gradient-based Bi-level Optimization for Deep Learning: A Survey

A Comparison of First-order Algorithms for Machine Learning

Review Non-convex Optimization Method for Machine Learning

Large-Scale Deep Learning Optimizations: A Comprehensive Survey

A Comparison of Optimization Algorithms for Deep Learning