Abstract:Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.

GSdyn: Learning training dynamics via online Gaussian optimization with gradient states

A hybrid training algorithm based on gradient descent and evolutionary computation

Online hyperparameter optimization by real-time recurrent learning

Learning Gradient Descent: Better Generalization and Longer Horizons

Ensemble Kalman Filtering for Online Gaussian Process Regression and Learning

Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte Carlo

Gradient Methods with Online Scaling

Practical Bayesian Optimization of Machine Learning Algorithms

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Recursive Gaussian Process State Space Model

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach.

Cost-Efficient Online Hyperparameter Optimization

A Randomized Block-Coordinate Adam online learning optimization algorithm

Pre-training the Deep Generative Models with Adaptive Hyperparameter Optimization

Bayesian Optimization for Policy Search via Online-Offline Experimentation

An Invariant Information Geometric Method for High-Dimensional Online Optimization

Towards Hyperparameter-Agnostic DNN Training via Dynamical System Insights

Event-Based Control for Online Training of Neural Networks

A Data-Driven Evolutionary Transfer Optimization for Expensive Problems in Dynamic Environments

GOALS: Gradient-Only Approximations for Line Searches Towards Robust and Consistent Training of Deep Neural Networks