Towards Robust Federated Analytics via Differentially Private Measurements of Statistical Heterogeneity

Mary Scott,Graham Cormode,Carsten Maple
2024-11-07
Abstract:Statistical heterogeneity is a measure of how skewed the samples of a dataset are. It is a common problem in the study of differential privacy that the usage of a statistically heterogeneous dataset results in a significant loss of accuracy. In federated scenarios, statistical heterogeneity is more likely to happen, and so the above problem is even more pressing. We explore the three most promising ways to measure statistical heterogeneity and give formulae for their accuracy, while simultaneously incorporating differential privacy. We find the optimum privacy parameters via an analytic mechanism, which incorporates root finding methods. We validate the main theorems and related hypotheses experimentally, and test the robustness of the analytic mechanism to different heterogeneity levels. The analytic mechanism in a distributed setting delivers superior accuracy to all combinations involving the classic mechanism and/or the centralized setting. All measures of statistical heterogeneity do not lose significant accuracy when a heterogeneous sample is used.
Machine Learning,Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in Federated Analytics (FA), the problem of model accuracy degradation caused by statistical heterogeneity (SH). Specifically: 1. **Challenges of statistical heterogeneity**: - Statistical heterogeneity refers to the degree of skew of samples in a dataset, that is, the extent to which the differences between individual sub - samples and the overall sample exceed the random error. - In Federated Learning (FL) and Federated Analytics (FA) scenarios, since the data distribution of each client may be different, this heterogeneity is more common, resulting in a significant decrease in model accuracy. 2. **Requirements for privacy protection**: - In order to protect user privacy, FL and FA are usually combined with Differential Privacy (DP), but the existing methods are not effective in dealing with statistical heterogeneity, especially in the case of non - independent and identically distributed (non - i.i.d.) data. 3. **Research objectives**: - The objective of the paper is to explore and propose methods that can measure statistical heterogeneity, and on this basis, combine differential privacy techniques to ensure that while protecting privacy, the loss of model accuracy due to statistical heterogeneity is minimized as much as possible. - Specifically, the paper introduces three promising statistical heterogeneity measurement methods, gives their accuracy formulas, and at the same time combines differential privacy techniques to optimize privacy parameters in order to improve the robustness and accuracy of the model. ### Main contributions - **Measuring statistical heterogeneity**: Three methods for measuring statistical heterogeneity are proposed, and their accuracy formulas are given. - **Combining differential privacy**: Differential privacy techniques are applied to these measurement methods to ensure high model accuracy while protecting privacy. - **Optimizing privacy parameters**: The optimal privacy parameters are found through an analytical mechanism and optimized using a root - finding method. - **Experimental verification**: The main theorems and related assumptions are verified through experiments, and the robustness of the analytical mechanism is tested under different heterogeneity levels. ### Examples of mathematical formulas - **Weighted average vector**: \[ \vec{\bar{\mu}}=\frac{\sum_{i = 1}^{n}w_i\vec{x}_i}{\sum_{i = 1}^{n}w_i} \] where \(w_i=\frac{1}{s_i^2}\), and \(s_i^2\) is the internal variance of the vector \(\vec{x}_i\). - **Q - statistic**: \[ Q(\vec{X})=\frac{1}{n}\sum_{i = 1}^{n}w_i(\vec{x}_i-\vec{\bar{\mu}})^2 \] - **I2 - statistic**: \[ I^2(\vec{X})=\max(0,1-\frac{n - 1}{Q(\vec{X})}) \] If \(Q < n - 1\), then assume \(Q = n - 1\) to ensure that \(I^2\in[0,1]\). Through these methods, the paper aims to provide an effective solution to meet the dual challenges of statistical heterogeneity and privacy protection in Federated Analytics.