Optimal Algorithms for Augmented Testing of Discrete Distributions

Maryam Aliakbarpour,Piotr Indyk,Ronitt Rubinfeld,Sandeep Silwal
2024-12-02
Abstract:We consider the problem of hypothesis testing for discrete distributions. In the standard model, where we have sample access to an underlying distribution $p$, extensive research has established optimal bounds for uniformity testing, identity testing (goodness of fit), and closeness testing (equivalence or two-sample testing). We explore these problems in a setting where a predicted data distribution, possibly derived from historical data or predictive machine learning models, is available. We demonstrate that such a predictor can indeed reduce the number of samples required for all three property testing tasks. The reduction in sample complexity depends directly on the predictor's quality, measured by its total variation distance from $p$. A key advantage of our algorithms is their adaptability to the precision of the prediction. Specifically, our algorithms can self-adjust their sample complexity based on the accuracy of the available prediction, operating without any prior knowledge of the estimation's accuracy (i.e. they are consistent). Additionally, we never use more samples than the standard approaches require, even if the predictions provide no meaningful information (i.e. they are also robust). We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal. Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach.
Machine Learning,Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively reduce the number of samples required in discrete distribution hypothesis testing given a predictive distribution. Specifically, the paper explores how to utilize these predictions to reduce sample complexity when conducting uniformity testing, identity testing (also known as goodness - of - fit testing), and closeness testing (or two - sample testing) when the predictive data distribution is available. The predictive distribution may be derived from historical data or predictive machine - learning models. ### Main Contributions 1. **Reduction of Sample Complexity**: The paper shows that the predictive distribution can reduce the number of samples required for the above three attribute - testing tasks. The reduction in sample complexity directly depends on the quality of the predictive distribution, that is, the total variation distance (TVD) between it and the actual distribution. 2. **Algorithm Adaptability**: The proposed algorithm can self - adjust the sample complexity according to the accuracy of the prediction without prior knowledge of the prediction precision. This means that the algorithm is consistent. Even if the information provided by the prediction is meaningless, the algorithm will not use more samples than the standard method, thus ensuring the robustness of the algorithm. 3. **Theoretical Optimality**: The paper provides lower - bound results, indicating that the improvement in sample complexity of the proposed algorithm is information - theoretically optimal. 4. **Experimental Verification**: The experimental results show that the performance of the proposed algorithm on real - data is significantly better than the worst - case sample - complexity guarantee, demonstrating the practical feasibility of the method. ### Technical Details - **Prediction Quality Metric**: Use the total variation distance (TVD) between the predictive distribution \(\hat{p}\) and the unknown distribution \(p\) to measure the accuracy of the prediction. - **Search and Test**: The algorithm is divided into two parts: search and test. The search part aims to guess \(\|p-\hat{p}\|_{\text{TV}}\), and the test part uses this guessed accuracy level for the actual distribution test. - **Sample Complexity**: - For uniformity testing and identity testing, when \(d\leq\alpha\), the required number of samples is \(\Theta\left(\frac{\sqrt{n}}{\epsilon^{2}}\right)\); when \(d > \alpha\), the required number of samples is \(\Theta\left(\min\left(\frac{1}{(d-\alpha)^{2}}, \frac{\sqrt{n}}{\epsilon^{2}}\right)\right)\), where \(d = \|q-\hat{p}\|_{\text{TV}}\). - For closeness testing, the required number of samples is \(\Theta\left(\frac{n^{2/3}\alpha^{1/3}}{\epsilon^{4/3}}+\frac{\sqrt{n}}{\epsilon^{2}}\right)\). ### Experimental Results - **Synthetic and Real Data**: The experimental results show that the proposed algorithm can significantly reduce sample complexity when processing synthetic and real data, especially when the predictive distribution is close to the actual distribution. For example, on network traffic data, the sample complexity is reduced by up to 40%. ### Related Work - **Perfect Predictor**: Early research assumed the existence of a perfect predictor (i.e., \(\hat{p}=p\)), but this assumption is often not valid in practice. - **Approximate Predictor**: Some works assume that the probability value of each element of the predictor is within the range of \((1\pm\epsilon)\) of the true value, but this assumption is still too strict. - **Support Estimation**: Other research focuses on the support estimation problem, allowing the predicted probability value to be within a certain constant multiple of the true value, but these methods are usually only applicable to specific problems. In conclusion, this paper...