Hao Chen,Nancy R. Zhang
Abstract:We study the problem of two-sample comparison with categorical data when the contingency table is sparsely populated. In modern applications, the number of categories is often comparable to the sample size, causing existing methods to have low power. When the number of categories is large, there is often underlying structure on the sample space that can be exploited. We propose a general non-parametric approach that utilizes similarity information on the space of all categories in two sample tests. Our approach extends the graph-based tests of Friedman and Rafsky (1979) and Rosenbaum (2005), which are tests base on graphs connecting observations by similarity. Both tests require uniqueness of the underlying graph and cannot be directly applied on categorical data. We explored different ways to extend graph-based tests to the categorical setting and found two types of statistics that are both powerful and fast to compute. We showed that their permutation null distributions are asymptotically normal and that their $p$-value approximations under typical settings are quite accurate, facilitating the application of the new approach. The approach is illustrated through several examples.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the low statistical test power encountered when conducting two - sample comparisons in categorical data. Specifically, when the contingency table is sparsely populated, that is, when the number of categories is comparable to or larger than the sample size, existing methods are often of low power. This is because traditional methods such as the Chi - square test perform well when each category is observed many times, but in modern applications, the number of possible categories is usually close to or even greater than the sample size, resulting in some areas in the contingency table not being fully sampled, thus reducing the power of existing test methods.
To overcome this challenge, the authors propose a new non - parametric method that uses similarity information on the category space to construct a graph structure, which is then used for two - sample tests. This method extends the graph - based test methods proposed by Friedman and Rafsky (1979) and Rosenbaum (2005). These methods originally test whether samples are from the same distribution based on the graph structure connecting observations. However, these methods require the uniqueness of the underlying graph structure and cannot be directly applied to categorical data. Therefore, this paper explores different methods to extend graph - based tests to categorical data and proposes two statistics that are both powerful and computationally fast. The study also shows that the permutation null distributions of these statistics are asymptotically normal, and the p - values are very accurate in typical settings, which facilitates the application of the new method.
Through several examples, the paper demonstrates the effectiveness of this method, especially when dealing with high - dimensional and sparse contingency tables, its power is significantly better than that of the traditional Chi - square test.