Abstract:The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

A Closer Look at Deep Policy Gradients

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning

Fractal Landscapes in Policy Optimization

Identifying Policy Gradient Subspaces

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Gradient Information Matters in Policy Optimization by Back-propagating through Model

Deep learning: a statistical viewpoint

Policy Gradient Algorithms Implicitly Optimize by Continuation

Elementary Analysis of Policy Gradient Methods

Behind the Myth of Exploration in Policy Gradients

Mollification Effects of Policy Gradient Methods

Deep deterministic policy gradient algorithm: A systematic review

Understanding and Diagnosing Deep Reinforcement Learning

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

Is the Policy Gradient a Gradient?

Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers

On the Linear Convergence of Natural Policy Gradient Algorithm

Deep Metric Tensor Regularized Policy Gradient

Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond