Abstract:In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank) subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.

Towards Better Generalization of Deep Neural Networks via Non-Typicality Sampling Scheme

Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling

"Oddball SGD": Novelty Driven Stochastic Gradient Descent for Training Deep Neural Networks

Accelerating Stochastic Gradient Descent Using Antithetic Sampling.

Importance Sampling for Stochastic Gradient Descent in Deep Neural Networks

Batch Normalization Sampling.

Demystifying SGD with Doubly Stochastic Gradients

Make the Most of Your Data: Changing the Training Data Distribution to Improve In-distribution Generalization Performance

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Stability and Generalization for Minibatch SGD and Local SGD

Towards Better Generalization: BP-SVRG in Training Deep Neural Networks

Generalization Analysis of Stochastic Weight Averaging with General Sampling

Can we learn better with hard samples?

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Noisy Truncated SGD: Optimization and Generalization

Adaptive Sampling for Deep Learning via Efficient Nonparametric Proxies

Uniform Learning in a Deep Neural Network via "Oddball" Stochastic Gradient Descent

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

N-SVRG: Stochastic Variance Reduction Gradient with Noise Reduction Ability for Small Batch Samples