Abstract:Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.

Adaptive learning rate optimization algorithms with dynamic bound based on Barzilai-Borwein method

Barzilai-Borwein-based Adaptive Learning Rate for Deep Learning

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

An Adaptive and Momental Bound Method for Stochastic Learning

An Automatic Learning Rate Schedule Algorithm for Achieving Faster Convergence and Steeper Descent

AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

AdaDB: an Adaptive Gradient Method with Data-Dependent Bound.

ABNGrad: adaptive step size gradient descent for optimizing neural networks

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

Adaptive step size rules for stochastic optimization in large-scale learning

Dynamic Batch Adaptation

BiSLS/SPS: Auto-tune Step Sizes for Stable Bi-level Optimization

AdaBB: Adaptive Barzilai-Borwein Method for Convex Optimization

Effective Neural Network Training with a New Weighting Mechanism-Based Optimization Algorithm.

BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization

Online Learning for DNN Training: A Stochastic Block Adaptive Gradient Algorithm

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

(Rectified Version) The Barzilai-Borwein Method for Distributed Optimization over Unbalanced Directed Networks

Optimization Study of BP Neural Network Based on Genetic Algorithm