Abstract:We consider the problem of hypothesis testing for discrete distributions. In the standard model, where we have sample access to an underlying distribution $p$, extensive research has established optimal bounds for uniformity testing, identity testing (goodness of fit), and closeness testing (equivalence or two-sample testing). We explore these problems in a setting where a predicted data distribution, possibly derived from historical data or predictive machine learning models, is available. We demonstrate that such a predictor can indeed reduce the number of samples required for all three property testing tasks. The reduction in sample complexity depends directly on the predictor's quality, measured by its total variation distance from $p$. A key advantage of our algorithms is their adaptability to the precision of the prediction. Specifically, our algorithms can self-adjust their sample complexity based on the accuracy of the available prediction, operating without any prior knowledge of the estimation's accuracy (i.e. they are consistent). Additionally, we never use more samples than the standard approaches require, even if the predictions provide no meaningful information (i.e. they are also robust). We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal. Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively reduce the number of samples required in discrete distribution hypothesis testing given a predictive distribution. Specifically, the paper explores how to utilize these predictions to reduce sample complexity when conducting uniformity testing, identity testing (also known as goodness - of - fit testing), and closeness testing (or two - sample testing) when the predictive data distribution is available. The predictive distribution may be derived from historical data or predictive machine - learning models. ### Main Contributions 1. **Reduction of Sample Complexity**: The paper shows that the predictive distribution can reduce the number of samples required for the above three attribute - testing tasks. The reduction in sample complexity directly depends on the quality of the predictive distribution, that is, the total variation distance (TVD) between it and the actual distribution. 2. **Algorithm Adaptability**: The proposed algorithm can self - adjust the sample complexity according to the accuracy of the prediction without prior knowledge of the prediction precision. This means that the algorithm is consistent. Even if the information provided by the prediction is meaningless, the algorithm will not use more samples than the standard method, thus ensuring the robustness of the algorithm. 3. **Theoretical Optimality**: The paper provides lower - bound results, indicating that the improvement in sample complexity of the proposed algorithm is information - theoretically optimal. 4. **Experimental Verification**: The experimental results show that the performance of the proposed algorithm on real - data is significantly better than the worst - case sample - complexity guarantee, demonstrating the practical feasibility of the method. ### Technical Details - **Prediction Quality Metric**: Use the total variation distance (TVD) between the predictive distribution $\hat{p}$ and the unknown distribution $p$ to measure the accuracy of the prediction. - **Search and Test**: The algorithm is divided into two parts: search and test. The search part aims to guess $\|p-\hat{p}\|_{\text{TV}}$, and the test part uses this guessed accuracy level for the actual distribution test. - **Sample Complexity**: - For uniformity testing and identity testing, when $d\leq\alpha$, the required number of samples is $\Theta\left(\frac{\sqrt{n}}{\epsilon^{2}}\right)$; when $d > \alpha$, the required number of samples is $\Theta\left(\min\left(\frac{1}{(d-\alpha)^{2}}, \frac{\sqrt{n}}{\epsilon^{2}}\right)\right)$, where $d = \|q-\hat{p}\|_{\text{TV}}$. - For closeness testing, the required number of samples is $\Theta\left(\frac{n^{2/3}\alpha^{1/3}}{\epsilon^{4/3}}+\frac{\sqrt{n}}{\epsilon^{2}}\right)$. ### Experimental Results - **Synthetic and Real Data**: The experimental results show that the proposed algorithm can significantly reduce sample complexity when processing synthetic and real data, especially when the predictive distribution is close to the actual distribution. For example, on network traffic data, the sample complexity is reduced by up to 40%. ### Related Work - **Perfect Predictor**: Early research assumed the existence of a perfect predictor (i.e., $\hat{p}=p$), but this assumption is often not valid in practice. - **Approximate Predictor**: Some works assume that the probability value of each element of the predictor is within the range of $(1\pm\epsilon)$ of the true value, but this assumption is still too strict. - **Support Estimation**: Other research focuses on the support estimation problem, allowing the predicted probability value to be within a certain constant multiple of the true value, but these methods are usually only applicable to specific problems. In conclusion, this paper...

Optimal Algorithms for Augmented Testing of Discrete Distributions

The Asymptotic Distribution and Berry–Esseen Bound of a New Test for Independence in High Dimension with an Application to Stochastic Optimization

Equivalence Testing: The Power of Bounded Adaptivity

Test without Trust: Optimal Locally Private Distribution Testing

Distribution-free Multiple Testing

Optimal Private and Communication Constraint Distributed Goodness-of-Fit Testing for Discrete Distributions in the Large Sample Regime

Optimal Approximate Sampling from Discrete Probability Distributions

Optimal Multi-Distribution Learning

Sequential algorithms for testing identity and closeness of distributions

Efficient Discrepancy Testing for Learning with Distribution Shift

Testing with Non-identically Distributed Samples

Improving and extending the testing of distributions for shape-restricted properties

Distribution Testing with a Confused Collector

On the Optimal Error Exponent of Type-Based Distributed Hypothesis Testing

Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints

A View on Out-of-Distribution Identification from a Statistical Testing Theory Perspective

Improving Pearson's chi-squared test: hypothesis testing of distributions -- optimally

Testing Support Size More Efficiently Than Learning Histograms

Testing Poisson Binomial Distributions

Conditional Independence Testing for Discrete Distributions: Beyond $χ^2$- and $G$-tests

From Data to Decisions: Distributionally Robust Optimization Is Optimal