Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv,Micah Goldblum,Arpit Bansal,C. Bayan Bruss,Yann LeCun,Andrew Gordon Wilson
2024-06-17
Abstract:It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.
Machine Learning
What problem does this paper attempt to address?
This paper discusses the flexibility issue of neural networks in practical applications. The traditional view is that neural networks can fit at least as many training samples as the number of their parameters, but in reality, our training process (including optimizers and regularizers) limits this flexibility. The study found that: 1. The local minima found by standard optimizers can usually perfectly fit far fewer training samples than the number of model parameters. 2. Convolutional neural networks (CNNs) are more parameter-efficient than multilayer perceptrons (MLPs) and Transformers (ViTs) on random label data. 3. Although stochastic gradient descent (SGD) is considered to have a regularization effect, it can actually find local minima that fit more training data, rather than full-batch gradient descent. 4. The model's ability to fit correctly labeled samples exceeds that of mislabeled samples, which may serve as an indicator of prediction generalization. 5. The ReLU activation function enables the model to fit more training samples after finding local minima, despite its purpose being to avoid gradient vanishing and exploding in deep architectures. The paper empirically investigates the fitting ability of models on a limited training set and analyzes the impact of factors such as data types, model architectures, and optimizers on the number of training samples that the model can fit. These findings challenge the traditional notion of overfitting and reveal the flexibility characteristics of neural networks in practical applications.