Abstract:Optimization is at the heart of machine learning, statistics, and many applied scientific disciplines. It also has a long history in physics, ranging from the minimal action principle to finding ground states of disordered systems such as spin glasses. Proximal algorithms form a class of methods that are broadly applicable and are particularly well-suited to nonsmooth, constrained, large-scale, and distributed optimization problems. There are essentially five proximal algorithms currently known, each proposed in seminal work: Forward-backward splitting, Tseng splitting, Douglas-Rachford, alternating direction method of multipliers, and the more recent Davis-Yin. These methods sit on a higher level of abstraction compared to gradient-based ones, with deep roots in nonlinear functional analysis. In this paper we show that all of these methods are actually different discretizations of a single differential equation, namely, the simple gradient flow which dates back to Cauchy (1847). An important aspect behind many of the success stories in machine learning relies on "accelerating" the convergence of first-order methods. However, accelerated methods are notoriously difficult to analyze, counterintuitive, and without an underlying guiding principle. We show that similar discretization schemes applied to Newton's equation with an additional dissipative force, which we refer to as accelerated gradient flow, allow us to obtain accelerated variants of all these proximal algorithms-the majority of which are new although some recover known cases in the literature. Furthermore, we extend these methods to stochastic settings, allowing us to make connections with Langevin and Fokker-Planck equations. Similar ideas apply to gradient descent, heavy ball, and Nesterov's method which are simpler. Our results therefore provide a unified framework from which several important optimization methods are nothing but simulations of classical dissipative systems.

Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms

Convergence Analysis of Asynchronous Stochastic Recursive Gradient Algorithms

Analysis of Stochastic Gradient Descent in Continuous Time

Differential Equations for Modeling Asynchronous Algorithms

Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

Continuous-time stochastic gradient descent for optimizing over the stationary distribution of stochastic differential equations

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

Gradient flows and proximal splitting methods: A unified view on accelerated and stochastic optimization

Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem

Non asymptotic analysis of Adaptive stochastic gradient algorithms and applications

Accelerated stochastic approximation with state-dependent noise

An SDE Perspective on Stochastic Inertial Gradient Dynamics with Time-Dependent Viscosity and Geometric Damping

Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic Gradient Descent using Stochastic Learning Rates

Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

A Unified Approach to Analyzing Asynchronous Coordinate Descent and Tatonnement

Understanding the unstable convergence of gradient descent.

Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise

The Anytime Convergence of Stochastic Gradient Descent with Momentum: From a Continuous-Time Perspective