Abstract:Understanding generalization of overparametrized neural networks remains a fundamental challenge in machine learning. Most of the literature mostly studies generalization from an interpolation point of view, taking convergence of parameters towards a global minimum of the training loss for granted. While overparametrized architectures indeed interpolated the data for typical classification tasks, this interpolation paradigm does not seem valid anymore for more complex tasks such as in-context learning or diffusion. Instead for such tasks, it has been empirically observed that the trained models goes from global minima to spurious local minima of the training loss as the number of training samples becomes larger than some level we call optimization threshold. While the former yields a poor generalization to the true population loss, the latter was observed to actually correspond to the minimiser of this true loss. This paper explores theoretically this phenomenon in the context of two-layer ReLU networks. We demonstrate that, despite overparametrization, networks often converge toward simpler solutions rather than interpolating the training data, which can lead to a drastic improvement on the test loss with respect to interpolating solutions. Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions. This directional alignment, which occurs in the early stage of training, leads to a simplicity bias, wherein the network approximates the ground truth model without converging to the global minimum of the training loss. Our results suggest that this bias, resulting in an optimization threshold from which interpolation is not reached anymore, is beneficial and enhances the generalization of trained models.

Learning Two-Layer ReLU Networks Is Nearly as Easy as Learning Linear Classifiers on Separable Data

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization

Benign Overfitting for Two-layer ReLU Convolutional Neural Networks

How Implicit Regularization of ReLU Neural Networks Characterizes the Learned Function -- Part I: the 1-D Case of Two Layers with Random First Layer

Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign?

Training a Two Layer ReLU Network Analytically

Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks

Gradient Descent can Learn Less Over-parameterized Two-layer Neural Networks on Classification Problems

Learning Narrow One-Hidden-Layer ReLU Networks

Convex Formulations for Training Two-Layer ReLU Neural Networks

On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting

EraseReLU: A Simple Way to Ease the Training of Deep Convolution Neural Networks.

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

Interpretable global minima of deep ReLU neural networks on sequentially separable data

On the Principles of ReLU Networks with One Hidden Layer

Simplicity bias and optimization threshold in two-layer ReLU networks

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Two-layer Networks with the ReLU^k Activation Function: Barron Spaces and Derivative Approximation

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes