Abstract:Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the following issues: 1. **Effectiveness of Optimization Algorithms**: Currently, the understanding of optimization algorithms is rather fragmented, especially regarding whether enhanced optimization can improve generalizability. 2. **Intrinsic Stochastic Nature of Stochastic Gradient Descent (SGD) and Its Variants**: Existing research has overlooked the stochastic characteristics of SGD and its variants, leading to a lack of comprehensive benchmarking and in-depth understanding of their statistical performance. The paper seeks to fill this gap by adopting a novel approach. Instead of merely evaluating the endpoint of a single optimization trajectory, the authors estimate the steady-state distribution of a stochastic optimizer from a set of trajectories. Specifically, the paper covers a wide range of optimization techniques, including SGD and its variants, flat minima optimizers, and new algorithms proposed under the Basin Hopping framework. Through these methods, the paper emphasizes fair benchmarking and compares the steady-state distributions of different optimizers, establishing statistical significance. The main findings of the paper include the relationship between training loss and validation set accuracy, as well as the performance comparison between SGD, noise-enhanced variants, and the newly proposed optimizers. These algorithms perform comparably to flat minima optimizers (such as SAM) but require only half the number of gradient computations. The authors hope that this work will promote further exploration in the field of deep learning optimization and encourage a shift from single-model approaches to methods that place greater emphasis on the stochasticity of optimizers.

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

Gradient Descent, Stochastic Optimization, and Other Tales

A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

Understanding the Role of Optimization in Double Descent

Learning Gradient Descent: Better Generalization and Longer Horizons

Multiplicative noise and heavy tails in stochastic optimization

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Stability and Generalization for Minibatch SGD and Local SGD

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths

Understanding Stochastic Optimization Behavior at the Layer Update Level (Student Abstract)

A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks

Beyond Convexity: Stochastic Quasi-Convex Optimization

Optimization Methods in Deep Learning: A Comprehensive Overview

Empirical Tests of Optimization Assumptions in Deep Learning

Learning Non-Vacuous Generalization Bounds from Optimization

Model-Based Deep Learning: On the Intersection of Deep Learning and Optimization

Demystifying SGD with Doubly Stochastic Gradients