Abstract:Understanding generalization of overparametrized neural networks remains a fundamental challenge in machine learning. Most of the literature mostly studies generalization from an interpolation point of view, taking convergence of parameters towards a global minimum of the training loss for granted. While overparametrized architectures indeed interpolated the data for typical classification tasks, this interpolation paradigm does not seem valid anymore for more complex tasks such as in-context learning or diffusion. Instead for such tasks, it has been empirically observed that the trained models goes from global minima to spurious local minima of the training loss as the number of training samples becomes larger than some level we call optimization threshold. While the former yields a poor generalization to the true population loss, the latter was observed to actually correspond to the minimiser of this true loss. This paper explores theoretically this phenomenon in the context of two-layer ReLU networks. We demonstrate that, despite overparametrization, networks often converge toward simpler solutions rather than interpolating the training data, which can lead to a drastic improvement on the test loss with respect to interpolating solutions. Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions. This directional alignment, which occurs in the early stage of training, leads to a simplicity bias, wherein the network approximates the ground truth model without converging to the global minimum of the training loss. Our results suggest that this bias, resulting in an optimization threshold from which interpolation is not reached anymore, is beneficial and enhances the generalization of trained models.

When Are Bias-Free ReLU Networks Like Linear Networks?

ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models

Expressivity of ReLU-Networks under Convex Relaxations

Neural networks with ReLU powers need less depth

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks

Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks

Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals

Learning Two-Layer ReLU Networks Is Nearly as Easy as Learning Linear Classifiers on Separable Data

Topological Expressivity of ReLU Neural Networks

Stably unactivated neurons in ReLU neural networks

Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

Overparameterized ReLU Neural Networks Learn the Simplest Model: Neural Isometry and Phase Transitions

On Theoretical Analysis of Single Hidden Layer Feedforward Neural Networks with Relu Activations

How Implicit Regularization of ReLU Neural Networks Characterizes the Learned Function -- Part I: the 1-D Case of Two Layers with Random First Layer

Multi-Bias Non-linear Activation in Deep Neural Networks

Implicit Regularization in ReLU Networks with the Square Loss

ReLUs Are Sufficient for Learning Implicit Neural Representations

The Evolution of the Interplay Between Input Distributions and Linear Regions in Networks

Simplicity bias and optimization threshold in two-layer ReLU networks

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training