Abstract:Non-parametric two-sample tests based on energy distance or maximum mean discrepancy are widely used statistical tests for comparing multivariate data from two populations. While these tests enjoy desirable statistical properties, their test statistics can be expensive to compute as they require the computation of 3 distinct Euclidean distance (or kernel) matrices between samples, where the time complexity of each of these computations (namely, $O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically with the number of samples ($n_x$, $n_y$) and linearly with the number of variables ($p$). Since the standard permutation test requires repeated re-computations of these expensive statistics it's application to large datasets can become unfeasible. While several statistical approaches have been proposed to mitigate this issue, they all sacrifice desirable statistical properties to decrease the computational cost (e.g., trade computation speed by a decrease in statistical power). A better computational strategy is to first pre-compute the Euclidean distance (kernel) matrix of the concatenated data, and then permute indexes and retrieve the corresponding elements to compute the re-sampled statistics. While this strategy can reduce the computation cost relative to the standard permutation test, it relies on the computation of a larger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$. In this paper, we present a novel computationally efficient permutation algorithm which only requires the pre-computation of the 3 smaller matrices and achieves large computational speedups without sacrificing finite-sample validity or statistical power. We illustrate its computational gains in a series of experiments and compare its statistical power to the current state-of-the-art approach for balancing computational cost and statistical performance.

Scaling property of the statistical Two-Sample Energy Test

Calculating $p$-values and their significances with the Energy Test for large datasets

A new test for the multivariate two-sample problem based on the concept of minimum energy

A new class of binning free, multivariate goodness-of-fit tests: the energy tests

Statistical Methods for Investigating the Cosmic Ray Energy Spectrum

On the distribution of the power function for the scale parameter of exponential families

Energy distance and kernel mean embedding for two sample survival test

Large-Scale Simultaneous Testing Using Kernel Density Estimation

A nonparametric two-sample conditional distribution test

A two-sample nonparametric test for one-sided location-scale alternative

Two samples test for discrete power-law distributions

Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics

Testing Equality of Spectral Density Operators for Functional Processes

The AUGUST Two-Sample Test: Powerful, Interpretable, and Fast

Two-Sample Smooth Tests for the Equality of Distributions

A Kernel Method for the Two-Sample Problem

Refereeing the Referees: Evaluating Two-Sample Tests for Validating Generators in Precision Sciences

Convergence Criteria for Single-step Free-energy Calculations: The Relation between the Π Bias Measure and the Sample Variance

E-Valuating Classifier Two-Sample Tests

An Entropy-Based Approach for Nonparametrically Testing Simple Probability Distribution Hypotheses

Two-Sample Test for Sparse High Dimensional Multinomial Distributions