Abstract:Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.

State Space Representation and Phase Analysis of Gradient Descent Optimizers.

Accelerated Optimization in Deep Learning with a Proportional-Integral-derivative Controller

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

System response curve based first‐order optimization algorithms for cyber‐physical‐social intelligence

Gradient Information Matters in Policy Optimization by Back-propagating through Model

Efficient and stable SAV-based methods for gradient flows arising from deep learning

Near-optimal control of dynamical systems with neural ordinary differential equations

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

Learning to be Global Optimizer

Learning Gradient Descent: Better Generalization and Longer Horizons

A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks

On Empirical Comparisons of Optimizers for Deep Learning

Variational Stochastic Gradient Descent for Deep Neural Networks

PID Controller-Based Stochastic Optimization Acceleration for Deep Neural Networks

Gradient Descent based Optimization Algorithms for Deep Learning Models Training

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant

Towards Hyperparameter-Agnostic DNN Training via Dynamical System Insights

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Gradient Descent Optimization in Deep Learning Model Training Based on Multistage and Method Combination Strategy