Abstract:Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.

A Comparison of Optimization Algorithms for Deep Learning

A comparative study of recently deep learning optimizers

Optimization Methods in Deep Learning: A Comprehensive Overview

An Efficient Optimization Technique for Training Deep Neural Networks

Research on Optimization of Image Recognition Algorithm Based on Deep Learning

Optimization of deep learning models: benchmark and analysis

Assessment of Optimizers impact on Image Recognition with Convolutional Neural Network to Adversarial Datasets

Effectiveness of Optimization Algorithms in Deep Image Classification

Comparative Investigation of Learning Algorithms for Image Classification with Small Dataset

Optimization for deep learning: theory and algorithms

On Empirical Comparisons of Optimizers for Deep Learning

Empirical Tests of Optimization Assumptions in Deep Learning

Computational issues in Optimization for Deep networks

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

Optimization Algorithm Inspired Deep Neural Network Structure Design

Effective Neural Network Training with a New Weighting Mechanism-Based Optimization Algorithm.

Convergence of Stochastic Gradient Descent in Deep Neural Network

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

Adaptation of nature inspired optimization algorithms for deep learning

Advanced metaheuristic optimization techniques in applications of deep neural networks: a review

In-Depth Case Study on Artificial Neural Network Weights Optimization Using Meta-Heuristic and Heuristic Algorithmic Approach