Don't Be So Positive: Negative Step Sizes in Second-Order Methods

Betty Shea,Mark Schmidt
2024-12-06
Abstract:The value of second-order methods lies in the use of curvature information. Yet, this information is costly to extract and once obtained, valuable negative curvature information is often discarded so that the method is globally convergent. This limits the effectiveness of second-order methods in modern machine learning. In this paper, we show that second-order and second-order-like methods are promising optimizers for neural networks provided that we add one ingredient: negative step sizes. We show that under very general conditions, methods that produce ascent directions are globally convergent when combined with a Wolfe line search that allows both positive and negative step sizes. We experimentally demonstrate that using negative step sizes is often more effective than common Hessian modification methods.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in modern machine learning, especially in the training of neural networks, the effectiveness of second - order optimization methods is limited by their ability to handle non - convex problems. Specifically: 1. **Limitations of second - order methods in non - convex optimization**: - Second - order methods rely on curvature information, but this information is costly to extract, and once obtained, standard methods usually discard valuable negative curvature information to ensure global convergence. This limits the effectiveness of second - order methods in modern machine learning. - When training neural networks, since the objective function is non - convex, second - order methods are prone to getting stuck in saddle points and local maxima, and are therefore considered unsuitable for such tasks. 2. **Deficiencies of existing solutions**: - Existing improvement methods such as Hessian modification, trust - region methods, and cubic regularization can ensure global convergence, but these methods often require expensive computational resources and may lose accurate curvature information. - First - order methods (such as the gradient descent method and its variants) can converge, but are slow because they do not fully utilize the curvature information of the loss function. 3. **The proposed new method**: - This paper proposes a new idea: using negative step sizes to utilize negative curvature information. Specifically, when the search direction is an ascent direction, allowing the use of negative step sizes enables the optimizer to explore the search space more effectively, thereby improving optimization performance. - The paper proves that under very general conditions, a method that combines Wolfe line search and allows positive and negative step sizes can achieve global convergence in a non - convex setting. In summary, the main contribution of this paper lies in exploring the role of negative step sizes in optimization, proposing a simple and effective method to utilize negative curvature information, and thus improving the performance of second - order methods in neural network training. Experimental results show that quasi - Newton methods combined with negative step sizes, such as SR1, outperform traditional methods on multiple datasets.