Abstract:Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

A Coefficient Makes SVRG Effective

Optimization Variance: Exploring Generalization Properties of DNNs

Stochastic Zeroth-order Optimization Via Variance Reduction Method.

Stochastic Variance Reduction for Deep Q-learning

Variance Reduction via Accelerated Dual Averaging for Finite-Sum Optimization

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

Finite-sum optimization: Adaptivity to smoothness and loopless variance reduction

Stochastic Sub-Sampled Newton Method with Variance Reduction

SVRG Meets AdaGrad: Painless Variance Reduction

Variance Suppression: Balanced Training Process in Deep Learning

Variance-reduced Reshuffling Gradient Descent for Nonconvex Optimization: Centralized and Distributed Algorithms

Variational Stochastic Gradient Descent for Deep Neural Networks

Stochastic Gradient Descent with Variance Reduction Technique

Stochastic Nested Variance Reduction for Nonconvex Optimization

Adaptive Variance Reducing for Stochastic Gradient Descent.

Variance-Reduced Proximal Stochastic Gradient Descent for Non-convex Composite optimization.

Empirical Tests of Optimization Assumptions in Deep Learning

A Variational Inequality Model for Learning Neural Networks

Divergence Results and Convergence of a Variance Reduced Version of ADAM

Variance reduction techniques for stochastic proximal point algorithms