A survey of deep learning optimizers -- first and second order methods

Rohan Kashyap
2023-09-27
Abstract:Deep Learning optimization involves minimizing a high-dimensional loss function in the weight space which is often perceived as difficult due to its inherent difficulties such as saddle points, local minima, ill-conditioning of the Hessian and limited compute resources. In this paper, we provide a comprehensive review of $14$ standard optimization methods successfully used in deep learning research and a theoretical assessment of the difficulties in numerical optimization from the optimization literature.
Machine Learning,Computer Vision and Pattern Recognition,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered in deep - learning optimization, especially how to effectively minimize high - dimensional loss functions. Specifically, the author reviews and evaluates 14 standard optimization methods, which have been successfully applied in deep - learning research, and theoretically evaluates the difficulties in numerical optimization from the perspective of the optimization literature. ### Main problems include: 1. **Saddle points and local minima**: The loss functions in deep learning are usually highly non - convex, with a large number of saddle points and local minima, which makes the optimization process complex and prone to getting stuck in sub - optimal solutions. 2. **Ill - conditioned Hessian matrix**: The Hessian matrix may have a poor condition number, resulting in unstable or too - slow gradient updates during the optimization process. 3. **Limited computing resources**: Since deep - learning models usually have a large number of parameters, the optimization process requires efficient algorithms to reduce the consumption of computing resources. ### Main objectives of the paper: - Provide a comprehensive review, covering first - order and second - order optimization methods. - Evaluate the effectiveness and limitations of various optimization methods in dealing with the above - mentioned problems. - Provide guidance for researchers to select appropriate optimization methods, especially when facing different types of non - convex optimization problems. ### Overview of specific content: - **First - order methods**: Such as Stochastic Gradient Descent (SGD), momentum method, Nesterov momentum, AdaGrad, RMSProp and Adam, etc. These methods mainly rely on gradient information for parameter updates. - **Second - order methods**: Such as Newton's method, quasi - Newton method (BFGS), etc. These methods use the information of the Hessian matrix to accelerate convergence, but the computational cost is high. By comparing these methods, the paper aims to help readers understand the advantages and disadvantages of each method and provide valuable references for practical applications.