Abstract:It is well known that neural networks with many more parameters than training examples do not overfit. Implicit regularization phenomena, which are still not well understood, occur during optimization and 'good' networks are favored. Thus the number of parameters is not an adequate measure of complexity if we do not consider all possible networks but only the 'good' ones. To better understand which networks are favored during optimization, we study the geometry of the output set as parameters vary. When the inputs are fixed, we prove that the dimension of this set changes and that the local dimension, called batch functional dimension, is almost surely determined by the activation patterns in the hidden layers. We prove that the batch functional dimension is invariant to the symmetries of the network parameterization: neuron permutations and positive rescalings. Empirically, we establish that the batch functional dimension decreases during optimization. As a consequence, optimization leads to parameters with low batch functional dimensions. We call this phenomenon geometry-induced implicit regularization.The batch functional dimension depends on both the network parameters and inputs. To understand the impact of the inputs, we study, for fixed parameters, the largest attainable batch functional dimension when the inputs vary. We prove that this quantity, called computable full functional dimension, is also invariant to the symmetries of the network's parameterization, and is determined by the achievable activation patterns. We also provide a sampling theorem, showing a fast convergence of the estimation of the computable full functional dimension for a random input of increasing size. Empirically we find that the computable full functional dimension remains close to the number of parameters, which is related to the notion of local identifiability. This differs from the observed values for the batch functional dimension computed on training inputs and test inputs. The latter are influenced by geometry-induced implicit regularization.

The Computational Complexity of ReLU Network Training Parameterized by Data Dimensionality

Polynomial-Time Solutions for ReLU Network Training: A Complexity Classification via Max-Cut and Zonotopes

Towards Lower Bounds on the Depth of ReLU Neural Networks

On the growth of the parameters of approximating ReLU neural networks

Neural networks with ReLU powers need less depth

On Functional Dimension and Persistent Pseudodimension

The Geometric Structure of Fully-Connected ReLU Layers

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron

Deep Neural Networks with ReLU-Sine-Exponential Activations Break Curse of Dimensionality in Approximation on Hölder Class.

Learning Narrow One-Hidden-Layer ReLU Networks

Rates of Approximation by ReLU Shallow Neural Networks

Learning Distributions Generated by One-Layer ReLU Networks

Nonparametric regression using over-parameterized shallow ReLU neural networks

Phase Diagram for Two-layer ReLU Neural Networks at Infinite-width Limit.

Geometry-induced Implicit Regularization in Deep ReLU Neural Networks

On Size-Independent Sample Complexity of ReLU Networks

Three Quantization Regimes for ReLU Networks

Lower Bounds on the Depth of Integral ReLU Neural Networks via Lattice Polytopes

Deep Neural Networks with ReLU-Sine-Exponential Activations Break Curse of Dimensionality on Hölder Class

How Implicit Regularization of ReLU Neural Networks Characterizes the Learned Function -- Part I: the 1-D Case of Two Layers with Random First Layer