Risk Bounds on MDL Estimators for Linear Regression Models with Application to Simple ReLU Neural Networks

Yoshinari Takeishi,Jun'ichi Takeuchi
2024-07-04
Abstract:To investigate the theoretical foundations of deep learning from the viewpoint of the minimum description length (MDL) principle, we analyse risk bounds of MDL estimators based on two-stage codes for simple two-layers neural networks (NNs) with ReLU activation. For that purpose, we propose a method to design two-stage codes for linear regression models and establish an upper bound on the risk of the corresponding MDL estimators based on the theory on MDL estimators originated by Barron and Cover (1991). Then, we apply this result to the simple two-layers NNs with ReLU activation which consist of $d$ nodes in the input layer, $m$ nodes in the hidden layer and one output node. Since the object of estimation is only the $m$ weights from the hidden layer to the output node in our setting, this is an example of linear regression models. As a result, we show that the redundancy of the obtained two-stage codes is small owing to the fact that the eigenvalue distribution of the Fisher information matrix of the NNs is strongly biased, which was recently shown by Takeishi et al. (2023). That is, we establish a tight upper bound on the risk of our MDL estimators. Note that our risk bound, of which the leading term is $O(d^2 \log n /n)$, is independent of the number of parameters $m$.
Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to analyze the risk bounds of linear regression models and simple ReLU neural networks through the Minimum Description Length (MDL) principle. Specifically, the objectives of the paper include: 1. **Design a two - stage encoding method**: For the linear regression model, propose a method based on two - stage encoding and establish the upper bound of the risk of the corresponding MDL estimator. 2. **Apply to ReLU neural networks**: Apply the above - mentioned method to a two - layer neural network with ReLU activation function, especially those networks with \(d\) nodes in the input layer, \(m\) nodes in the hidden layer and one node in the output layer. 3. **Risk bound analysis**: By analyzing the eigenvalue distribution of the Fisher information matrix, prove that the designed two - stage encoding method has less redundancy in these networks, thereby obtaining a tight risk upper bound. 4. **Independent of the number of parameters**: Prove that the main term of the obtained risk bound is \(O\left(\frac{d^{2}\log n}{n}\right)\) and does not depend on the number of parameters \(m\). ### Background and Motivation - **Success and challenges of deep learning**: Although deep learning has achieved remarkable success in multiple fields, its theoretical guarantees are insufficient, leading to concerns about model reliability and generalization ability. - **Application of the MDL principle**: Through the MDL principle, data can be compressed to the shortest description length, thereby achieving effective learning. The work of Barron and Cover in 1991 provided a mathematical basis for this. - **Simplification of the linear regression model**: In the paper, the researchers focus on the parameter estimation problem of a two - layer ReLU neural network, especially the weight estimation from the hidden layer to the output layer, which can be regarded as a linear regression problem. ### Main Contributions - **Design of two - stage encoding**: Utilize the strong skewness of the eigenvalue distribution of the Fisher information matrix to design an efficient two - stage encoding method. - **Specific form of risk bounds**: For the linear regression model, give a specific upper bound of risk and apply it to the ReLU neural network to obtain a tight risk upper bound. - **Combination of theory and practice**: Through theoretical analysis and experimental verification, demonstrate the effectiveness of the proposed MDL estimator in practical applications. ### Key Technologies - **Approximate spectral decomposition of the Fisher information matrix**: Utilize the work of Takeishi et al. (2023) to perform approximate spectral decomposition on the Fisher information matrix, revealing the strong skewness of the eigenvalue distribution. - **Quantization strategy for two - stage encoding**: Quantize the parameter space according to the magnitude of the eigenvalues, thereby designing an efficient encoding method. ### Conclusion The paper successfully establishes the risk bounds of linear regression models and simple ReLU neural networks through the MDL principle and proves that these bounds do not depend on the number of parameters \(m\), thus providing a new perspective for the theoretical analysis of deep learning models.