Large Deviation Analysis of Score-based Hypothesis Testing

Enmao Diao,Taposh Banerjee,Vahid Tarokh
2024-02-04
Abstract:Score-based statistical models play an important role in modern machine learning, statistics, and signal processing. For hypothesis testing, a score-based hypothesis test is proposed in \cite{wu2022score}. We analyze the performance of this score-based hypothesis testing procedure and derive upper bounds on the probabilities of its Type I and II errors. We prove that the exponents of our error bounds are asymptotically (in the number of samples) tight for the case of simple null and alternative hypotheses. We calculate these error exponents explicitly in specific cases and provide numerical studies for various other scenarios of interest.
Signal Processing,Methodology
What problem does this paper attempt to address?
The paper mainly discusses the application of score-based hypothesis testing in statistical modeling, especially in addressing the challenges of handling unnormalized models and score models in modern machine learning. Traditional likelihood ratio test (LRT) is optimal when the data density is known, but in many complex models, exact likelihood computation is difficult. Therefore, the paper proposes a score-based hypothesis testing method that utilizes the Hyvärinen Score. The paper first highlights the importance of score matching, especially in image generation tasks, where it outperforms likelihood-based methods. It then points out that gradient scores can be learned when the density of the data distribution is unknown, but the exact likelihood of unnormalized models cannot be computed directly. Therefore, the paper proposes a score-based binary hypothesis testing method that relies on the Hyvärinen Score instead of likelihood ratio. The paper analyzes the upper bounds of type I error (false positive) and type II error (false negative) of this score-based test under finite samples, and proves that these bounds are tight for simple hypothesis testing as the sample size approaches infinity. By using large deviation theory, the paper demonstrates the accuracy of these bounds in asymptotic behavior and provides numerical simulations to estimate the error exponent. Specifically, the paper calculates closed-form expressions of error exponents for multivariate Gaussian distributions and conducts numerical experiments with synthetic data (such as multivariate normal distributions, exponential families, and Gauss-Bernoulli restricted Boltzmann machines) as well as real-world data (such as the KDD Cup'99 network security dataset). The experimental results show that the proposed analysis is consistent with theoretical predictions, and as the sample size increases, the error exponent approaches the theoretical limit. In summary, the paper addresses the problem of hypothesis testing when the exact density of the data is unknown, proposes a score-based testing method, and provides in-depth theoretical analysis and empirical validation of its performance.