A Kernel Method for the Two-Sample Problem

Arthur Gretton,Karsten Borgwardt,Malte J. Rasch,Bernhard Scholkopf,Alexander J. Smola
DOI: https://doi.org/10.48550/arXiv.0805.2368
2008-05-16
Abstract:We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to design effective testing methods in statistics to determine whether two samples are from different probability distributions. Specifically, the author proposes a kernel - based framework to analyze and compare different distributions, and then designs statistical tests to determine whether two samples are from different distributions. The core of this method is the Maximum Mean Discrepancy (MMD), that is, the maximum difference in the expected values of two distributions on the set of functions in the unit ball of the Reproducing Kernel Hilbert Space (RKHS). The paper proposes three non - parametric statistical testing methods based on MMD: 1. **Test based on large deviation bounds**: The first two tests use distribution - independent uniform convergence bounds, providing test performance guarantees in the finite - sample case, but may be more conservative. 2. **Test based on asymptotic distribution**: The third test is based on the asymptotic distribution of the empirical estimate of MMD and is more sensitive to data with small sample sizes. These testing methods not only have good properties in theory but also perform well in practical applications, especially in dealing with high - dimensional data and low - sample - size situations, as well as the distribution comparison problem of graph data. In addition, the paper also discusses how to efficiently calculate MMD through a method with linear time complexity when the amount of data is large, so as to handle a larger amount of data under a given computational cost. In general, this paper aims to provide a powerful and flexible tool for detecting and comparing different probability distributions in various application scenarios, especially in fields such as bioinformatics and database attribute matching.