Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv,Micah Goldblum,Arpit Bansal,C. Bayan Bruss,Yann LeCun,Andrew Gordon Wilson

2024-06-17

Abstract:It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.

Machine Learning

What problem does this paper attempt to address?

This paper discusses the flexibility issue of neural networks in practical applications. The traditional view is that neural networks can fit at least as many training samples as the number of their parameters, but in reality, our training process (including optimizers and regularizers) limits this flexibility. The study found that: 1. The local minima found by standard optimizers can usually perfectly fit far fewer training samples than the number of model parameters. 2. Convolutional neural networks (CNNs) are more parameter-efficient than multilayer perceptrons (MLPs) and Transformers (ViTs) on random label data. 3. Although stochastic gradient descent (SGD) is considered to have a regularization effect, it can actually find local minima that fit more training data, rather than full-batch gradient descent. 4. The model's ability to fit correctly labeled samples exceeds that of mislabeled samples, which may serve as an indicator of prediction generalization. 5. The ReLU activation function enables the model to fit more training samples after finding local minima, despite its purpose being to avoid gradient vanishing and exploding in deep architectures. The paper empirically investigates the fitting ability of models on a limited training set and analyzes the impact of factors such as data types, model architectures, and optimizers on the number of training samples that the model can fit. These findings challenge the traditional notion of overfitting and reveal the flexibility characteristics of neural networks in practical applications.

Just How Flexible are Neural Networks in Practice?

A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks

Over-parametrized neural networks as under-determined linear systems

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Benign Overfitting in Deep Neural Networks under Lazy Training

Consistency of Neural Networks with Regularization

Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally

Visualizing the Loss Landscape of Neural Nets

Effect of Activation Functions on the Training of Overparametrized Neural Nets

Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD

How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural Networks with Linear Activations

Does a sparse ReLU network training problem always admit an optimum?

Towards an Understanding of Benign Overfitting in Neural Networks

On the Complexity of Learning Neural Networks

Convergence of Adversarial Training in Overparametrized Neural Networks

Self-Expanding Neural Networks

Do highly over-parameterized neural networks generalize since bad solutions are rare?

L G ] 1 9 Ju n 20 19 Convergence of Adversarial Training in Overparametrized Networks

Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis

A Flexible Selection Scheme for Minimum-Effort Transfer Learning