Abstract:In this paper we consider testing the equality of probability vectors of two independent multinomial distributions in high dimension. The classical chi-square test may have some drawbacks in this case since many of cell counts may be zero or may not be large enough. We propose a new test and show its asymptotic normality and the asymptotic power function. Based on the asymptotic power function, we present an application of our result to neighborhood type test which has been previously studied, especially for the case of fairly small $p$-values. To compare the proposed test with existing tests, we provide numerical studies including simulations and real data examples.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to test whether the probability vectors of two independent multinomial distributions are equal in high - dimensional sparse multinomial distributions. Specifically, when the number of categories is large, the traditional chi - square test may perform poorly because many cell counts are zero or not large enough. Therefore, the author proposes a new test method and shows its asymptotic normality and asymptotic power function.
### Specific problem description
In the case of high - dimensional sparse data, the traditional chi - square test (Pearson's chi - square test) may not be applicable because the counts in many cells may be zero or very small. This makes the traditional test methods based on large - sample theory ineffective. The specific problems mentioned in the paper include:
1. **High - dimensional sparse data**: When the number of categories \( k \) is very large and the counts in most categories are very small (such as 0, 1 or 2), the traditional chi - square test may no longer be applicable.
2. **Limitations of existing methods**: For example, Pearson's chi - square test requires that the count in each cell is large enough, but in the case of sparse data, this condition is often not met.
3. **Differences in multivariate mean vector tests**: Although there are many studies on mean vector tests under multivariate normal distributions or factor models, the assumptions of these studies are not applicable to multinomial distributions.
### Goals of the paper
The main goal of the paper is to propose a new test statistic for testing whether the probability vectors of two independent multinomial distributions are equal in the case of high - dimensional sparse data. Specifically, the paper hopes to solve the following problems:
- Propose a test method suitable for high - dimensional sparse data.
- Show the asymptotic normality and asymptotic power function of this method.
- Compare the performance of the new method with that of existing methods through numerical simulations and actual data analysis.
### Mathematical formula representation
To describe the problem more precisely, the following mathematical symbols and formulas are used in the paper:
- Suppose there are two independent multinomial distributions \( N_c=(N_{c1},\ldots,N_{ck}) \), where \( c = 1,2 \), which respectively follow \( \text{Multinomial}(n_c,P_c,k) \), where \( P_c=(p_{c1},p_{c2},\ldots,p_{ck}) \) is the probability vector.
- The test hypothesis is:
\[
H_0: P_1 = P_2\quad\text{vs.}\quad H_1: P_1\neq P_2
\]
- The newly proposed test statistic is based on an unbiased estimate of the Euclidean distance:
\[
D=\sum_{i = 1}^k\left(\frac{X_{1i}}{n_1}-\frac{X_{2i}}{n_2}\right)^2-\frac{X_{1i}}{n_1^2}-\frac{X_{2i}}{n_2^2}
\]
- The standardized form of asymptotic normality is:
\[
\frac{\sum_{i = 1}^k f^*(N_{1i},N_{2i})-||\xi||_2^2}{\sigma_k}\xrightarrow{d}N(0,1)
\]
where \( \sigma_k^2 = 2\sum_{i = 1}^k\left(\frac{p_{1i}}{n_1}+\frac{p_{2i}}{n_2}\right)^2 \).
Through these formulas and assumptions, the paper aims to provide a more effective test method to deal with the problem of multinomial distribution testing under high - dimensional sparse data.