Abstract:The neural network memorization problem is to study the expressive power of neural networks to interpolate a finite dataset. Although memorization is widely believed to have a close relationship with the strong generalizability of deep learning when using over-parameterized models, to the best of our knowledge, there exists no theoretical study on the generalizability of memorization neural networks. In this paper, we give the first theoretical analysis of this topic. Since using i.i.d. training data is a necessary condition for a learning algorithm to be generalizable, memorization and its generalization theory for i.i.d. datasets are developed under mild conditions on the data distribution. First, algorithms are given to construct memorization networks for an i.i.d. dataset, which have the smallest number of parameters and even a constant number of parameters. Second, we show that, in order for the memorization networks to be generalizable, the width of the network must be at least equal to the dimension of the data, which implies that the existing memorization networks with an optimal number of parameters are not generalizable. Third, a lower bound for the sample complexity of general memorization algorithms and the exact sample complexity for memorization algorithms with constant number of parameters are given. It is also shown that there exist data distributions such that, to be generalizable for them, the memorization network must have an exponential number of parameters in the data dimension. Finally, an efficient and generalizable memorization algorithm is given when the number of training samples is greater than the efficient memorization sample complexity of the data distribution.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the relationship between neural network memorization and generalizability. Specifically, although memorization is considered to be closely related to the strong generalizability of over - parameterized models in deep learning, prior to this, there has been no theoretical research on the generalizability of memorized neural networks. This paper conducts a theoretical analysis on this topic for the first time and explores the influence of the structure and sample complexity of memorized neural networks on generalizability under the condition of independent and identically distributed (i.i.d.) data sets. ### Main problems 1. **Parameter complexity of memorized neural networks**: - Research on how to construct a memorized neural network with the minimum number of parameters. - Prove that for independent and identically distributed data sets, there exists an algorithm that can construct a memorized network with a width of 6 and a depth of \( O(\sqrt{N} \ln(Nn/c)) \). 2. **Generalization conditions of memorized neural networks**: - Give the necessary conditions for a memorized neural network to have generalizability, especially that the network width must be at least equal to the dimension of the data. - Prove that a memorized network with a fixed width cannot generalize under certain data distributions. 3. **Sample complexity of memorized networks**: - Give the lower and upper bounds of the sample complexity of memorized networks. - Prove that for certain data distributions, an exponential number of samples is required to achieve generalization. 4. **Existence of efficient memorization algorithms**: - Explore whether there exists a polynomial - time memorization algorithm and give its sample complexity. ### Specific problems - **Memorized network with the minimum number of parameters**: - Define the memorization parameter complexity \( N_D \), and prove that for any \( D \sim D(n, c) \), there exists a constant \( N_D \) such that almost all \( D_{tr} \sim D^N \) can be represented by a memorized network with no more than \( N_D \) parameters. - **Necessary conditions for generalization**: - Prove that for a set \( H \) of neural networks with a width of \( w \), there exists a data distribution \( D \) and an integer \( n > w \) such that any memorized network of \( D_{tr} \) is not generalizable in \( H \). - Prove that for almost all data distributions \( D \), there exists a memorized network with \( O(\sqrt{N}) \) parameters, but it is not generalizable. - **Lower and upper bounds of sample complexity**: - Prove that in order for a memorized network to generalize, the number of samples \( N \) must satisfy \( N \geq \Omega(N_D^2 \ln^2(N_D)) \). - Prove that for a memorized network with no more than \( N_D \) parameters, if \( N = O(N_D^2 \ln(N_D)) \), then the network is generalizable. - **Efficient memorization algorithms**: - Prove that there exists a constant \( S_D \) depending on \( D \) such that when \( N = O(S_D) \), a generalizable memorized network with \( O(N^2 n) \) parameters can be constructed in polynomial time. Through these studies, the paper provides a theoretical basis for understanding the generalizability of memorized neural networks and points out the possible path for constructing efficient and generalizable memorization algorithms.

Generalizability of Memorization Neural Networks

The Pitfalls of Memorization: When Memorization Hurts Generalization

Memorization with neural nets: going beyond the worst case

ResMem: Learn what you can and memorize the rest

Memorization in deep learning: A survey

Generalization and Memorization: the Bias Potential Model.

Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks

On Memorization in Diffusion Models

Learn to Forget: Memorization Elimination for Neural Networks.

Disentangling Trainability and Generalization in Deep Neural Networks

Exploring Memorization in Adversarial Training

Unveiling Privacy, Memorization, and Input Curvature Links

A Geometric Framework for Understanding Memorization in Generative Models

Generalization Memorization Machine with Zero Empirical Risk for Classification

Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models

On the Over-Memorization During Natural, Robust and Catastrophic Overfitting

Least Squares Generalization-Memorization Regression

An Optimal Transport Analysis on Generalization in Deep Learning

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

Representations and generalization in artificial and brain neural networks