Abstract:Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g. trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This "implicit regularization" feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e. phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a byproduct, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control --- measured entrywise and by the spectral norm --- which might be of independent interest.

Implicit Sparse Regularization: The Impact of Depth and Early Stopping

Implicit Regularization in Deep Matrix Factorization

High-Dimensional Linear Regression via Implicit Regularization

Combining Explicit and Implicit Regularization for Efficient Learning in Deep Networks

Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

Implicit Regularization in ReLU Networks with the Square Loss

Robust Implicit Regularization via Weight Normalization

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank

A Dynamics Theory of Implicit Regularization in Deep Low-Rank Matrix Factorization

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning

On Regularization via Early Stopping for Least Squares Regression

Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks

Implicit Regularization of Dropout

A Unified Dynamic Approach to Sparse Model Selection

Deep linear networks for regression are implicitly regularized towards flat minima

DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

Regularization-wise double descent: Why it occurs and how to eliminate it

Mask in the Mirror: Implicit Sparsification