Abstract:In this paper we provide a novel analytical perspective on the theoretical understanding of gradient-based learning algorithms by interpreting consensus-based optimization (CBO), a recently proposed multi-particle derivative-free optimization method, as a stochastic relaxation of gradient descent. Remarkably, we observe that through communication of the particles, CBO exhibits a stochastic gradient descent (SGD)-like behavior despite solely relying on evaluations of the objective function. The fundamental value of such link between CBO and SGD lies in the fact that CBO is provably globally convergent to global minimizers for ample classes of nonsmooth and nonconvex objective functions, hence, on the one side, offering a novel explanation for the success of stochastic relaxations of gradient descent. On the other side, contrary to the conventional wisdom for which zero-order methods ought to be inefficient or not to possess generalization abilities, our results unveil an intrinsic gradient descent nature of such heuristics. This viewpoint furthermore complements previous insights into the working principles of CBO, which describe the dynamics in the mean-field limit through a nonlinear nonlocal partial differential equation that allows to alleviate complexities of the nonconvex function landscape. Our proofs leverage a completely nonsmooth analysis, which combines a novel quantitative version of the Laplace principle (log-sum-exp trick) and the minimizing movement scheme (proximal iteration). In doing so, we furnish useful and precise insights that explain how stochastic perturbations of gradient descent overcome energy barriers and reach deep levels of nonconvex functions. Instructive numerical illustrations support the provided theoretical insights.

Optimistic Meta-Gradients

Meta-Gradients in Non-Stationary Environments

Adaptive Gradient-Based Meta-Learning Methods

EvoGrad: Efficient Gradient-Based Meta-Learning and Hyperparameter Optimization

Bootstrapped Meta-Learning

Towards Understanding Generalization in Gradient-Based Meta-Learning

A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms

Parallel Momentum Methods Under Biased Gradient Estimations

Meta-Learning with Warped Gradient Descent.

A History of Meta-gradient: Gradient Methods for Meta-learning

Gradient Descent: The Ultimate Optimizer

A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning

Curriculum in Gradient-Based Meta-Reinforcement Learning

Fast Adaptation with Kernel and Gradient based Meta Leaning

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

Accelerated Gradient-free Neural Network Training by Multi-convex Alternating Optimization

Adversarial gradient-based meta learning with metric-based test

Gradient is All You Need?

Flatter, faster: scaling momentum for optimal speedup of SGD