Abstract:It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. These results either assume that the activation function is ReLU or they crucially depend on the minimum eigenvalue of a certain Gram matrix depending on the data, random initialization and the activation function. In the later case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds. On the empirical side, a contemporary line of investigations has proposed a number of alternative activation functions which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. A crucial property that governs the performance of an activation is whether or not it is smooth. For non-smooth activations such as ReLU, SELU and ELU, all eigenvalues of the associated Gram matrix are large under minimal assumptions on the data. For smooth activations such as tanh, swish and polynomials, the situation is more complex. If the subspace spanned by the data has small dimension then the minimum eigenvalue of the Gram matrix can be small leading to slow training. But if the dimension is large and the data satisfies another mild condition, then the eigenvalues are large. If we allow deep networks, then the small data dimension is not a limitation provided that the depth is sufficient. We discuss a number of extensions and applications of these results.

Over-parametrized neural networks as under-determined linear systems

Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality

Towards an Understanding of Benign Overfitting in Neural Networks

Do highly over-parameterized neural networks generalize since bad solutions are rare?

Effect of Activation Functions on the Training of Overparametrized Neural Nets

A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network

A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks

On the Complexity of Learning Neural Networks

Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off

Benign Overfitting in Deep Neural Networks under Lazy Training

Nonparametric regression using over-parameterized shallow ReLU neural networks

Harmless Overparametrization in Two-layer Neural Networks

An Improved Analysis of Training Over-parameterized Deep Neural Networks

A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks

Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime

Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

On the Impact of Overparameterization on the Training of a Shallow Neural Network in High Dimensions

The Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features