Abstract:There has been a resurgence of interest in the asymptotic normality of incomplete U-statistics that only sum over roughly as many kernel evaluations as there are data samples, due to its computational efficiency and usefulness in quantifying the uncertainty for ensemble-based predictions. In this paper, we focus on the normal convergence of one such construction, the incomplete U-statistic with Bernoulli sampling, based on a raw sample of size $n$ and a computational budget $N$. Under minimalistic moment assumptions on the kernel, we offer accompanying Berry-Esseen bounds of the natural rate $1/\sqrt{\min(N, n)}$ that characterize the normal approximating accuracy involved when $n \asymp N$, i.e. $n$ and $N$ are of the same order in such a way that $n/N$ is lower-and-upper bounded by constants. Our key techniques include Stein's method specialized for the so-called Studentized nonlinear statistics, and an exponential lower tail bound for non-negative kernel U-statistics.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively evaluate the normal approximation accuracy of incomplete U - statistics in the case of limited computing resources. Specifically, the paper focuses on the Berry - Esseen theorem of incomplete U - statistics constructed by Bernoulli sampling. This problem is particularly important in the field of machine learning because incomplete U - statistics provide a computationally feasible method to quantify the uncertainty of ensemble predictions while maintaining statistical efficiency.
### Background and Motivation
Traditional U - statistics (Complete U - statistics) are very computationally expensive. Especially when the data sample size is large, it is necessary to sum over all possible sub - samples, and its computational complexity is \(O(n^m)\), where \(n\) is the sample size and \(m\) is the order of the kernel function. In contrast, incomplete U - statistics only sum over some sub - samples, thus greatly reducing the computational burden. However, whether this method of reducing the computational burden can maintain statistical efficiency and what the accuracy of its normal approximation is are important research questions.
### Main Contributions of the Paper
1. **Berry - Esseen Theorem**: The paper establishes the Berry - Esseen theorem for incomplete U - statistics under Bernoulli sampling and provides bounds on the normal approximation accuracy. Specifically, the paper proves that under certain moment conditions, the normal approximation error of incomplete U - statistics decays at a rate of \(\frac{1}{\sqrt{\min(N, n)}}\), where \(N\) is the computational budget, that is, the expected number of kernel function evaluations.
2. **Technical Methods**: The paper uses Stein's method and the exponential lower tail bound of non - negative U - statistics to derive the Berry - Esseen bounds. These technical methods are very effective in dealing with complex statistics, especially in dealing with non - linear statistics.
3. **Applications and Significance**: The analysis of the normal approximation accuracy of incomplete U - statistics is of great significance for ensemble methods in machine learning. For example, Mentch and Hooker (2016) pointed out that incomplete U - statistics can be used to quantify the prediction uncertainty of ensemble models such as random forests. Therefore, the results of this paper not only enrich the research on U - statistics theoretically but also have important value in practical applications.
### Main Results
The main results of the paper can be summarized by the following theorem:
**Theorem 2.1** (Berry - Esseen Theorem): Let \(U'_{n,N}\) be an incomplete U - statistic constructed by Bernoulli sampling, the kernel function \(h\) is symmetric on \(M^m\) and satisfies conditions (1.2) and (1.3), \(E[|h|^3]<\infty\), and \(2\leq m < n/2\). There exists an absolute constant \(C > 0\) such that for any \(z\in\mathbb{R}\), we have:
\[
\sup_{z\in\mathbb{R}}\left|P\left(\frac{\sqrt{n}U'_{n,N}}{\sigma}\leq z\right)-\Phi(z)\right|\leq C\left\{\left(\frac{m}{n}\left(\frac{\sigma_h^2}{m\sigma_g^2}-1\right)\right)^{1/2}+\frac{E[|g|^3]}{\sqrt{n}\sigma_g^3}+\frac{E[|h|^3](1 - 2p + 2p^2)}{\sigma_h^3\sqrt{N(1 - p)}}+\exp\left(-\frac{[n/m]\sigma_h^6}{24(E[|h|^3])^2}\right)+R_{n,m,N,h}
\]