A Kernel Method for the Two-Sample Problem

Arthur Gretton,Karsten Borgwardt,Malte J. Rasch,Bernhard Scholkopf,Alexander J. Smola

DOI: https://doi.org/10.48550/arXiv.0805.2368

2008-05-16

Abstract:We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to design effective testing methods in statistics to determine whether two samples are from different probability distributions. Specifically, the author proposes a kernel - based framework to analyze and compare different distributions, and then designs statistical tests to determine whether two samples are from different distributions. The core of this method is the Maximum Mean Discrepancy (MMD), that is, the maximum difference in the expected values of two distributions on the set of functions in the unit ball of the Reproducing Kernel Hilbert Space (RKHS). The paper proposes three non - parametric statistical testing methods based on MMD: 1. **Test based on large deviation bounds**: The first two tests use distribution - independent uniform convergence bounds, providing test performance guarantees in the finite - sample case, but may be more conservative. 2. **Test based on asymptotic distribution**: The third test is based on the asymptotic distribution of the empirical estimate of MMD and is more sensitive to data with small sample sizes. These testing methods not only have good properties in theory but also perform well in practical applications, especially in dealing with high - dimensional data and low - sample - size situations, as well as the distribution comparison problem of graph data. In addition, the paper also discusses how to efficiently calculate MMD through a method with linear time complexity when the amount of data is large, so as to handle a larger amount of data under a given computational cost. In general, this paper aims to provide a powerful and flexible tool for detecting and comparing different probability distributions in various application scenarios, especially in fields such as bioinformatics and database attribute matching.

A Kernel Method for the Two-Sample Problem

Generalized kernel two-sample tests

Exponentially Consistent Kernel Two-Sample Tests

A uniform kernel trick for high-dimensional two-sample problems

A Kernel-Based Conditional Two-Sample Test Using Nearest Neighbors (with Applications to Calibration, Regression Curves, and Simulation-Based Inference)

Spectral Regularized Kernel Two-Sample Tests

The AUGUST Two-Sample Test: Powerful, Interpretable, and Fast

A unified framework for multivariate two-sample and k-sample kernel-based quadratic distance goodness-of-fit tests

B-tests: Low Variance Kernel Two-Sample Tests

Nyström Kernel Stein Discrepancy

A distance based two-sample test of means difference for multivariate datasets

Testing distributional equality for functional random variables

A High-dimensional Convergence Theorem for U-statistics with Applications to Kernel-based Testing

Boosting the Power of Kernel Two-Sample Tests

Learning Deep Kernels for Non-Parametric Two-Sample Tests

Nonparametric Two-Sample Testing by Betting

Two-Sample Smooth Tests for the Equality of Distributions

Hypothesis testing using pairwise distances and associated kernels (with Appendix)

Kernel Two-Sample Tests for Manifold Data

Large-Scale Simultaneous Testing Using Kernel Density Estimation

Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features