Abstract:We study the problem of two-sample comparison with categorical data when the contingency table is sparsely populated. In modern applications, the number of categories is often comparable to the sample size, causing existing methods to have low power. When the number of categories is large, there is often underlying structure on the sample space that can be exploited. We propose a general non-parametric approach that utilizes similarity information on the space of all categories in two sample tests. Our approach extends the graph-based tests of Friedman and Rafsky (1979) and Rosenbaum (2005), which are tests base on graphs connecting observations by similarity. Both tests require uniqueness of the underlying graph and cannot be directly applied on categorical data. We explored different ways to extend graph-based tests to the categorical setting and found two types of statistics that are both powerful and fast to compute. We showed that their permutation null distributions are asymptotically normal and that their $p$-value approximations under typical settings are quite accurate, facilitating the application of the new approach. The approach is illustrated through several examples.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the low statistical test power encountered when conducting two - sample comparisons in categorical data. Specifically, when the contingency table is sparsely populated, that is, when the number of categories is comparable to or larger than the sample size, existing methods are often of low power. This is because traditional methods such as the Chi - square test perform well when each category is observed many times, but in modern applications, the number of possible categories is usually close to or even greater than the sample size, resulting in some areas in the contingency table not being fully sampled, thus reducing the power of existing test methods. To overcome this challenge, the authors propose a new non - parametric method that uses similarity information on the category space to construct a graph structure, which is then used for two - sample tests. This method extends the graph - based test methods proposed by Friedman and Rafsky (1979) and Rosenbaum (2005). These methods originally test whether samples are from the same distribution based on the graph structure connecting observations. However, these methods require the uniqueness of the underlying graph structure and cannot be directly applied to categorical data. Therefore, this paper explores different methods to extend graph - based tests to categorical data and proposes two statistics that are both powerful and computationally fast. The study also shows that the permutation null distributions of these statistics are asymptotically normal, and the p - values are very accurate in typical settings, which facilitates the application of the new method. Through several examples, the paper demonstrates the effectiveness of this method, especially when dealing with high - dimensional and sparse contingency tables, its power is significantly better than that of the traditional Chi - square test.

Graph-Based Tests for Two-Sample Comparisons of Categorical Data

Tests for categorical data beyond Pearson: A distance covariance and energy distance approach

Two-sample testing for random graphs

On high-dimensional modifications of some graph-based two-sample tests

Nonparametric High-Dimensional Multi-Sample Tests based on Graph Theory

Graph-Based Tests for Multivariate Covariate Balance Under Multi-Valued Treatments

Two-Sample Test for Sparse High Dimensional Multinomial Distributions

On the properties of distance covariance for categorical data: Robustness, sure screening, and approximate null distributions

The AUGUST Two-Sample Test: Powerful, Interpretable, and Fast

The Classification Permutation Test: A Nonparametric Test for Equality of Multivariate Distributions

Graphical n-sample tests of correspondence of distributions

Testing Consistency of Two Histograms

Equivalence Test in Multi-dimensional Space with Applications in A/B Testing

A Kernel Method for the Two-Sample Problem

[Protein C--possible usefulness as a therapeutic weapon].

Optimal exact tests for composite alternative hypotheses on cross tabulated data

A class of nonparametric tests for the two-sample problem based on order statistics

Two-Sample Test Based on Classification Probability

Bayesian Optimal Two-sample Tests in High-dimension

Weighted Graph-Based Two-Sample Test via Empirical Likelihood

A Semiparametric Two-Sample Hypothesis Testing Problem for Random Graphs