Abstract:To investigate the theoretical foundations of deep learning from the viewpoint of the minimum description length (MDL) principle, we analyse risk bounds of MDL estimators based on two-stage codes for simple two-layers neural networks (NNs) with ReLU activation. For that purpose, we propose a method to design two-stage codes for linear regression models and establish an upper bound on the risk of the corresponding MDL estimators based on the theory on MDL estimators originated by Barron and Cover (1991). Then, we apply this result to the simple two-layers NNs with ReLU activation which consist of $d$ nodes in the input layer, $m$ nodes in the hidden layer and one output node. Since the object of estimation is only the $m$ weights from the hidden layer to the output node in our setting, this is an example of linear regression models. As a result, we show that the redundancy of the obtained two-stage codes is small owing to the fact that the eigenvalue distribution of the Fisher information matrix of the NNs is strongly biased, which was recently shown by Takeishi et al. (2023). That is, we establish a tight upper bound on the risk of our MDL estimators. Note that our risk bound, of which the leading term is $O(d^2 \log n /n)$, is independent of the number of parameters $m$.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to analyze the risk bounds of linear regression models and simple ReLU neural networks through the Minimum Description Length (MDL) principle. Specifically, the objectives of the paper include: 1. **Design a two - stage encoding method**: For the linear regression model, propose a method based on two - stage encoding and establish the upper bound of the risk of the corresponding MDL estimator. 2. **Apply to ReLU neural networks**: Apply the above - mentioned method to a two - layer neural network with ReLU activation function, especially those networks with $d$ nodes in the input layer, $m$ nodes in the hidden layer and one node in the output layer. 3. **Risk bound analysis**: By analyzing the eigenvalue distribution of the Fisher information matrix, prove that the designed two - stage encoding method has less redundancy in these networks, thereby obtaining a tight risk upper bound. 4. **Independent of the number of parameters**: Prove that the main term of the obtained risk bound is $O\left(\frac{d^{2}\log n}{n}\right)$ and does not depend on the number of parameters $m$. ### Background and Motivation - **Success and challenges of deep learning**: Although deep learning has achieved remarkable success in multiple fields, its theoretical guarantees are insufficient, leading to concerns about model reliability and generalization ability. - **Application of the MDL principle**: Through the MDL principle, data can be compressed to the shortest description length, thereby achieving effective learning. The work of Barron and Cover in 1991 provided a mathematical basis for this. - **Simplification of the linear regression model**: In the paper, the researchers focus on the parameter estimation problem of a two - layer ReLU neural network, especially the weight estimation from the hidden layer to the output layer, which can be regarded as a linear regression problem. ### Main Contributions - **Design of two - stage encoding**: Utilize the strong skewness of the eigenvalue distribution of the Fisher information matrix to design an efficient two - stage encoding method. - **Specific form of risk bounds**: For the linear regression model, give a specific upper bound of risk and apply it to the ReLU neural network to obtain a tight risk upper bound. - **Combination of theory and practice**: Through theoretical analysis and experimental verification, demonstrate the effectiveness of the proposed MDL estimator in practical applications. ### Key Technologies - **Approximate spectral decomposition of the Fisher information matrix**: Utilize the work of Takeishi et al. (2023) to perform approximate spectral decomposition on the Fisher information matrix, revealing the strong skewness of the eigenvalue distribution. - **Quantization strategy for two - stage encoding**: Quantize the parameter space according to the magnitude of the eigenvalues, thereby designing an efficient encoding method. ### Conclusion The paper successfully establishes the risk bounds of linear regression models and simple ReLU neural networks through the MDL principle and proves that these bounds do not depend on the number of parameters $m$, thus providing a new perspective for the theoretical analysis of deep learning models.

Risk Bounds on MDL Estimators for Linear Regression Models with Application to Simple ReLU Neural Networks

Robust nonparametric regression based on deep ReLU neural networks

Minimum Description Length Principle in Supervised Learning with Application to Lasso

Improved MDL Estimators Using Fiber Bundle of Local Exponential Families for Non-exponential Families

Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron

Neural Networks Generalize on Low Complexity Data

Nonparametric regression using over-parameterized shallow ReLU neural networks

Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks

Classification with Deep Neural Networks and Logistic Loss

A priori generalization error for two-layer ReLU neural network through minimum norm solution

Minimax optimality of deep neural networks on dependent data via PAC-Bayes bounds

Nonparametric regression using deep neural networks with ReLU activation function

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

High-dimensional Penalty Selection via Minimum Description Length Principle

Deep Neural Networks with ReLU-Sine-Exponential Activations Break Curse of Dimensionality in Approximation on Hölder Class.

Bridging the Gap Between Approximation and Learning via Optimal Approximation by ReLU MLPs of Maximal Regularity

Nonparametric logistic regression with deep learning

Rates of Approximation by ReLU Shallow Neural Networks

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Tight Certified Robustness via Min-Max Representations of ReLU Neural Networks

Low dimensional approximation and generalization of multivariate functions on smooth manifolds using deep ReLU neural networks