Abstract:We show that deep neural networks achieve dimension-independent rates of convergence for learning structured densities such as those arising in image, audio, video, and text applications. More precisely, we demonstrate that neural networks with a simple $L^2$-minimizing loss achieve a rate of $n^{-1/(4+r)}$ in nonparametric density estimation when the underlying density is Markov to a graph whose maximum clique size is at most $r$, and we provide evidence that in the aforementioned applications, this size is typically constant, i.e., $r=O(1)$. We then establish that the optimal rate in $L^1$ is $n^{-1/(2+r)}$ which, compared to the standard nonparametric rate of $n^{-1/(2+d)}$, reveals that the effective dimension of such problems is the size of the largest clique in the Markov random field. These rates are independent of the data's ambient dimension, making them applicable to realistic models of image, sound, video, and text data. Our results provide a novel justification for deep learning's ability to circumvent the curse of dimensionality, demonstrating dimension-independent convergence rates in these contexts.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore the performance of deep neural networks in structured density estimation, especially how to achieve a dimension - independent convergence rate in high - dimensional data (such as images, audio, video, and text). Specifically, the authors show that when the underlying density satisfies certain conditions, deep neural networks can achieve a convergence rate independent of the intrinsic dimension of the data. The following are the main objectives and contributions of the paper: 1. **Solve the curse of dimensionality problem in high - dimensional density estimation**: - Traditional non - parametric density estimation methods face the "curse of dimensionality" in high - dimensional space, that is, the sample complexity grows exponentially with the dimension. However, deep learning methods can perform well when dealing with high - dimensional data and can learn complex density functions using relatively few data points. 2. **Introduce the Markov random field (MRF) hypothesis**: - The authors propose a new hypothesis, that is, for some types of data (such as images, audio, etc.), their density functions can be described by Markov random fields. In this case, the dependence relationships of the data can be represented by a graph structure, where each node represents a variable and the edges represent conditional independence. 3. **Prove the convergence rate of neural networks under the MRF hypothesis**: - The authors prove that under the MRF hypothesis, a neural network using a simple L2 minimization loss function can achieve a convergence rate $n^{-1/(4 + r)}$ related to the size $r$ of the maximum clique, rather than related to the ambient dimension $d$ of the data. This indicates that in these applications, the effective dimension is the size $r$ of the maximum clique, which is usually a constant $O(1)$. 4. **Provide theoretical basis**: - The paper provides a theoretical basis to explain why deep learning can bypass the curse of dimensionality in high - dimensional density estimation tasks. By using the MRF structure, neural networks can capture local dependence relationships and ignore the correlations between distant variables, thus achieving a dimension - independent convergence rate. 5. **Experimental verification**: - The authors experimentally verify the validity of the MRF hypothesis, especially for image data, showing how the conditional independence between pixels supports the MRF model. ### Formula summary - **Convergence rate of L2 minimization loss function**: \[ \| p - \hat{p}_n \|_1 \in eO\left(n^{-1/(4 + r)}\right) \] where $r$ is the size of the maximum clique. - **L1 optimal convergence rate**: \[ \| p - \hat{p}_n \|_1 \in eO\left(n^{-1/(2 + r)}\right) \] - **Standard non - parametric convergence rate**: \[ \| p - \hat{p}_n \|_1 \in O\left(n^{-1/(2 + d)}\right) \] Through these results, the paper provides a new theoretical explanation for the success of deep learning in high - dimensional density estimation and shows the effectiveness of the MRF structure in practical applications.

Dimension-independent rates for structured neural density estimation

Breaking the curse of dimensionality in structured density estimation

A Neural Scaling Law from the Dimension of the Data Manifold

Analysis of the rate of convergence of fully connected deep neural network regression estimates with smooth activation function

Minimax density estimation for growing dimension

A Convergence Rate for Manifold Neural Networks

On the rate of convergence of a deep recurrent neural network estimate in a regression problem with dependent data

Nonparametric regression using deep neural networks with ReLU activation function

Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories

Enhanced Expressive Power and Fast Training of Neural Networks by Random Projections

Speed Limits for Deep Learning

Deep Neural Networks for Estimation and Inference

Deep Neural Networks for Nonparametric Interaction Models with Diverging Dimension

Dimension-independent learning rates for high-dimensional classification problems

Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting

Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff

Density Encoding Enables Resource-Efficient Randomly Connected Neural Networks

Near-optimal learning of Banach-valued, high-dimensional functions via deep neural networks

Can a Hebbian-like learning rule be avoiding the curse of dimensionality in sparse distributed data?

Deep Networks from the Principle of Rate Reduction

Can Shallow Neural Networks Beat the Curse of Dimensionality? A mean field training perspective