Abstract:It is well known that neural networks with many more parameters than training examples do not overfit. Implicit regularization phenomena, which are still not well understood, occur during optimization and 'good' networks are favored. Thus the number of parameters is not an adequate measure of complexity if we do not consider all possible networks but only the 'good' ones. To better understand which networks are favored during optimization, we study the geometry of the output set as parameters vary. When the inputs are fixed, we prove that the dimension of this set changes and that the local dimension, called batch functional dimension, is almost surely determined by the activation patterns in the hidden layers. We prove that the batch functional dimension is invariant to the symmetries of the network parameterization: neuron permutations and positive rescalings. Empirically, we establish that the batch functional dimension decreases during optimization. As a consequence, optimization leads to parameters with low batch functional dimensions. We call this phenomenon geometry-induced implicit regularization.The batch functional dimension depends on both the network parameters and inputs. To understand the impact of the inputs, we study, for fixed parameters, the largest attainable batch functional dimension when the inputs vary. We prove that this quantity, called computable full functional dimension, is also invariant to the symmetries of the network's parameterization, and is determined by the achievable activation patterns. We also provide a sampling theorem, showing a fast convergence of the estimation of the computable full functional dimension for a random input of increasing size. Empirically we find that the computable full functional dimension remains close to the number of parameters, which is related to the notion of local identifiability. This differs from the observed values for the batch functional dimension computed on training inputs and test inputs. The latter are influenced by geometry-induced implicit regularization.

$\Ell _1$ Regularization in Two-Layer Neural Networks.

The Efficacy of Regularization in Two Layer Neural Networks

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Sparse Deep Learning Models with the $\ell_1$ Regularization

How Implicit Regularization of ReLU Neural Networks Characterizes the Learned Function -- Part I: the 1-D Case of Two Layers with Random First Layer

On the Geometry of Regularization in Adversarial Training: High-Dimensional Asymptotics and Generalization Bounds

How (Implicit) Regularization of ReLU Neural Networks Characterizes the Learned Function -- Part II: the Multi-D Case of Two Layers with Random First Layer

Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off

Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations

$\ell_1$-Regularized Generalized Least Squares

Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks

Geometry-induced Implicit Regularization in Deep ReLU Neural Networks

Regularization-wise double descent: Why it occurs and how to eliminate it

Implicit Regularization in Deep Learning

Regularization theory in the study of generalization ability of a biological neural network model

On the Generalization Power of Overfitted Two-Layer Neural Tangent Kernel Models

Penetrating the influence of regularizations on neural network based on information bottleneck theory

A priori generalization error for two-layer ReLU neural network through minimum norm solution

Generalization Error Analysis of Neural networks with Gradient Based Regularization

Consistency of Neural Networks with Regularization