Abstract:Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g. trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This "implicit regularization" feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e. phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a byproduct, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control --- measured entrywise and by the spectral norm --- which might be of independent interest.

The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program

Asymptotic Convergence for a Class of Fully Nonlinear Curvature Flows

A Geometric Approach of Gradient Descent Algorithms in Linear Neural Networks

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport

Regularization properties of dual subgradient flow

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

Normalized gradient flow optimization in the training of ReLU artificial neural networks

The Convex Geometry of Network Flows

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows

On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Concavifiability and convergence: necessary and sufficient conditions for gradient descent analysis

On Convergence of Training Loss Without Reaching Stationary Points

Gradient Descent Provably Escapes Saddle Points in the Training of Shallow ReLU Networks

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

How to induce regularization in linear models: A guide to reparametrizing gradient flow

Hyperbolic Gradient Flow: Evolution of Graphs in R^n+1

Gradient descent for unbounded convex functions on Hadamard manifolds and its applications to scaling problems