Optimal Private and Communication Constraint Distributed Goodness-of-Fit Testing for Discrete Distributions in the Large Sample Regime

Lasse Vuursteen
2024-11-02
Abstract:We study distributed goodness-of-fit testing for discrete distribution under bandwidth and differential privacy constraints. Information constraint distributed goodness-of-fit testing is a problem that has received considerable attention recently. The important case of discrete distributions is theoretically well understood in the classical case where all data is available in one "central" location. In a federated setting, however, data is distributed across multiple "locations" (e.g. servers) and cannot readily be shared due to e.g. bandwidth or privacy constraints that each server needs to satisfy. We show how recently derived results for goodness-of-fit testing for the mean of a multivariate Gaussian model extend to the discrete distributions, by leveraging Le Cam's theory of statistical equivalence. In doing so, we derive matching minimax upper- and lower-bounds for the goodness-of-fit testing for discrete distributions under bandwidth or privacy constraints in the regime where the number of samples held locally is large.
Statistics Theory
What problem does this paper attempt to address?
The paper attempts to address the problem of performing goodness-of-fit testing for discrete distributions in a distributed environment, while considering bandwidth limitations and differential privacy constraints. Specifically, the researchers focus on how to effectively conduct goodness-of-fit testing for discrete distributions under bandwidth and privacy constraints when each server holds a large number of samples. ### Background and Motivation 1. **Distributed Environment**: In distributed scenarios such as federated learning, data is distributed across multiple locations (e.g., servers) and cannot be easily shared centrally because each server needs to meet bandwidth or privacy constraints. 2. **Existing Research**: Classical goodness-of-fit testing is well understood when all data is centralized in one location. However, in a distributed environment, data is dispersed and cannot be easily shared, presenting new challenges. 3. **Application Areas**: This problem has important applications in various fields, such as population genetics, information retrieval, speech and text classification, text mining, and large language models. ### Research Objectives 1. **Bandwidth Limitation**: Investigate how to perform goodness-of-fit testing for discrete distributions under bandwidth constraints. 2. **Differential Privacy**: Investigate how to perform goodness-of-fit testing for discrete distributions while meeting differential privacy requirements. 3. **Large Sample Scenario**: Pay special attention to the scenario where each server holds a large number of samples, which is common in many practical applications. ### Methods and Contributions 1. **Statistical Equivalence Theory**: Utilize Le Cam's statistical equivalence theory to transform the problem of goodness-of-fit testing for discrete distributions into the problem of mean goodness-of-fit testing for high-dimensional Gaussian models. 2. **Optimal Upper and Lower Bounds**: Derive the minimax upper and lower bounds under bandwidth and differential privacy constraints, which are consistent with the corresponding results for high-dimensional Gaussian models. 3. **Theoretical Analysis**: Through detailed theoretical analysis, demonstrate that in the large sample scenario, the problem of goodness-of-fit testing for discrete distributions is statistically equivalent to the problem of mean goodness-of-fit testing for high-dimensional Gaussian models. ### Main Results 1. **Minimax Rate under Bandwidth Limitation**: Under bandwidth constraints, the minimax rate for goodness-of-fit testing for discrete distributions is consistent with the corresponding results for high-dimensional Gaussian models. 2. **Minimax Rate under Differential Privacy**: Under differential privacy requirements, the minimax rate for goodness-of-fit testing for discrete distributions is also consistent with the corresponding results for high-dimensional Gaussian models. ### Conclusion By introducing statistical equivalence theory, this paper successfully addresses the problem of performing goodness-of-fit testing for discrete distributions in a distributed environment, particularly when each server holds a large number of samples and needs to meet bandwidth and differential privacy constraints. These results provide an important theoretical foundation for statistical inference in distributed environments.