Abstract:The value of second-order methods lies in the use of curvature information. Yet, this information is costly to extract and once obtained, valuable negative curvature information is often discarded so that the method is globally convergent. This limits the effectiveness of second-order methods in modern machine learning. In this paper, we show that second-order and second-order-like methods are promising optimizers for neural networks provided that we add one ingredient: negative step sizes. We show that under very general conditions, methods that produce ascent directions are globally convergent when combined with a Wolfe line search that allows both positive and negative step sizes. We experimentally demonstrate that using negative step sizes is often more effective than common Hessian modification methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in modern machine learning, especially in the training of neural networks, the effectiveness of second - order optimization methods is limited by their ability to handle non - convex problems. Specifically: 1. **Limitations of second - order methods in non - convex optimization**: - Second - order methods rely on curvature information, but this information is costly to extract, and once obtained, standard methods usually discard valuable negative curvature information to ensure global convergence. This limits the effectiveness of second - order methods in modern machine learning. - When training neural networks, since the objective function is non - convex, second - order methods are prone to getting stuck in saddle points and local maxima, and are therefore considered unsuitable for such tasks. 2. **Deficiencies of existing solutions**: - Existing improvement methods such as Hessian modification, trust - region methods, and cubic regularization can ensure global convergence, but these methods often require expensive computational resources and may lose accurate curvature information. - First - order methods (such as the gradient descent method and its variants) can converge, but are slow because they do not fully utilize the curvature information of the loss function. 3. **The proposed new method**: - This paper proposes a new idea: using negative step sizes to utilize negative curvature information. Specifically, when the search direction is an ascent direction, allowing the use of negative step sizes enables the optimizer to explore the search space more effectively, thereby improving optimization performance. - The paper proves that under very general conditions, a method that combines Wolfe line search and allows positive and negative step sizes can achieve global convergence in a non - convex setting. In summary, the main contribution of this paper lies in exploring the role of negative step sizes in optimization, proposing a simple and effective method to utilize negative curvature information, and thus improving the performance of second - order methods in neural network training. Experimental results show that quasi - Newton methods combined with negative step sizes, such as SR1, outperform traditional methods on multiple datasets.

Don't Be So Positive: Negative Step Sizes in Second-Order Methods

Exploiting Negative Curvature in Conjunction with Adaptive Sampling: Theoretical Results and a Practical Algorithm

Yet another fast variant of Newton’s method for nonconvex optimization

An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes

Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently

Second-order step-size tuning of SGD for non-convex optimization

Local Curvature Descent: Squeezing More Curvature out of Standard and Polyak Gradient Descent

Second-order Neural Network Training Using Complex-step Directional Derivative

Provably Faster Gradient Descent via Long Steps

Inexact Newton-type Methods for Optimisation with Nonnegativity Constraints

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Generalized Optimistic Methods for Convex-Concave Saddle Point Problems

A Subsampling Line-Search Method with Second-Order Results

Detecting negative eigenvalues of exact and approximate Hessian matrices in optimization

The loss landscape of deep linear neural networks: a second-order analysis

Adaptive Coordinate-Wise Step Sizes for Quasi-Newton Methods: A Learning-to-Optimize Approach

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Gradient Methods with Adaptive Step-Sizes

Two efficient gradient methods with approximately optimal stepsizes based on regularization models for unconstrained optimization

Accelerated Objective Gap and Gradient Norm Convergence for Gradient Descent via Long Steps