Abstract:In this paper we consider testing the equality of probability vectors of two independent multinomial distributions in high dimension. The classical chi-square test may have some drawbacks in this case since many of cell counts may be zero or may not be large enough. We propose a new test and show its asymptotic normality and the asymptotic power function. Based on the asymptotic power function, we present an application of our result to neighborhood type test which has been previously studied, especially for the case of fairly small $p$-values. To compare the proposed test with existing tests, we provide numerical studies including simulations and real data examples.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to test whether the probability vectors of two independent multinomial distributions are equal in high - dimensional sparse multinomial distributions. Specifically, when the number of categories is large, the traditional chi - square test may perform poorly because many cell counts are zero or not large enough. Therefore, the author proposes a new test method and shows its asymptotic normality and asymptotic power function. ### Specific problem description In the case of high - dimensional sparse data, the traditional chi - square test (Pearson's chi - square test) may not be applicable because the counts in many cells may be zero or very small. This makes the traditional test methods based on large - sample theory ineffective. The specific problems mentioned in the paper include: 1. **High - dimensional sparse data**: When the number of categories $ k $ is very large and the counts in most categories are very small (such as 0, 1 or 2), the traditional chi - square test may no longer be applicable. 2. **Limitations of existing methods**: For example, Pearson's chi - square test requires that the count in each cell is large enough, but in the case of sparse data, this condition is often not met. 3. **Differences in multivariate mean vector tests**: Although there are many studies on mean vector tests under multivariate normal distributions or factor models, the assumptions of these studies are not applicable to multinomial distributions. ### Goals of the paper The main goal of the paper is to propose a new test statistic for testing whether the probability vectors of two independent multinomial distributions are equal in the case of high - dimensional sparse data. Specifically, the paper hopes to solve the following problems: - Propose a test method suitable for high - dimensional sparse data. - Show the asymptotic normality and asymptotic power function of this method. - Compare the performance of the new method with that of existing methods through numerical simulations and actual data analysis. ### Mathematical formula representation To describe the problem more precisely, the following mathematical symbols and formulas are used in the paper: - Suppose there are two independent multinomial distributions $ N_c=(N_{c1},\ldots,N_{ck}) $, where $ c = 1,2 $, which respectively follow $ \text{Multinomial}(n_c,P_c,k) $, where $ P_c=(p_{c1},p_{c2},\ldots,p_{ck}) $ is the probability vector. - The test hypothesis is: \[ H_0: P_1 = P_2\quad\text{vs.}\quad H_1: P_1\neq P_2 \] - The newly proposed test statistic is based on an unbiased estimate of the Euclidean distance: \[ D=\sum_{i = 1}^k\left(\frac{X_{1i}}{n_1}-\frac{X_{2i}}{n_2}\right)^2-\frac{X_{1i}}{n_1^2}-\frac{X_{2i}}{n_2^2} \] - The standardized form of asymptotic normality is: \[ \frac{\sum_{i = 1}^k f^*(N_{1i},N_{2i})-||\xi||_2^2}{\sigma_k}\xrightarrow{d}N(0,1) \] where $ \sigma_k^2 = 2\sum_{i = 1}^k\left(\frac{p_{1i}}{n_1}+\frac{p_{2i}}{n_2}\right)^2 $. Through these formulas and assumptions, the paper aims to provide a more effective test method to deal with the problem of multinomial distribution testing under high - dimensional sparse data.

Two-Sample Test for Sparse High Dimensional Multinomial Distributions

Tests for a Multiple-Sample Problem in High Dimensions

Two-Sample Test of High Dimensional Means under Dependence

Two Sample Tests for High Dimensional Covariance Matrices

Testing high-dimensional multinomials with applications to text analysis

Two Sample Testing in High Dimension via Maximum Mean Discrepancy

Bayesian Optimal Two-sample Tests in High-dimension

Two-Sample Smooth Tests for the Equality of Distributions

Two sample test for covariance matrices in ultra-high dimension

Testing the homogeneity of risk differences with sparse count data

A two-sample test for high-dimensional data with applications to gene-set testing

Double verification for two‐sample covariance matrices test

A More Powerful Two-Sample Test in High Dimensions using Random Projection

Two-sample high dimensional mean test based on prepivots

A Neighborhood-Assisted Hotelling's $T^2$ Test for High-Dimensional Means

An adaptable generalization of Hotelling's $T^2$ test in high dimension

High-dimensional Two-Sample Mean Vectors Test and Support Recovery with Factor Adjustment.

Distribution and Correlation Free Two-Sample Test of High-Dimensional Means

A uniform kernel trick for high-dimensional two-sample problems

Log-Rank-Type Tests for Equality of Distributions in High-Dimensional Spaces.

Generalized kernel two-sample tests