Abstract:We consider the problem of mean estimation under user-level local differential privacy, where $n$ users are contributing through their local pool of data samples. Previous work assume that the number of data samples is the same across users. In contrast, we consider a more general and realistic scenario where each user $u \in [n]$ owns $m_u$ data samples drawn from some generative distribution $\mu$; $m_u$ being unknown to the statistician but drawn from a known distribution $M$ over $\mathbb{N}^\star$. Based on a distribution-aware mean estimation algorithm, we establish an $M$-dependent upper bounds on the worst-case risk over $\mu$ for the task of mean estimation. We then derive a lower bound. The two bounds are asymptotically matching up to logarithmic factors and reduce to known bounds when $m_u = m$ for any user $u$.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of mean estimation under the user - level local differential privacy (LDP) framework. Specifically, the authors consider a more realistic scenario where each user has a different number of data samples, rather than the same number of samples for all users as assumed in previous studies.
#### Background and Motivation
In the current era of artificial intelligence, machine - learning techniques rely on large training data sets to improve the accuracy of models. However, with the increasing attention of users to personal data privacy and the strengthening of regulatory frameworks, how to conduct effective data analysis while protecting user privacy has become an important issue. Differential privacy (DP) is a powerful privacy - protection tool that can protect user privacy in a quantified manner while maintaining statistical utility.
Traditional LDP - based models usually assume that each user contributes only one data sample, which is suitable for some basic application scenarios, such as the average salary estimation of multiple individuals. But in many practical applications, users may contribute multiple data points. For example, users can provide local databases containing multiple product or movie ratings for training recommendation systems; or provide multiple word sequences input on the mobile phone keyboard for training the next - word prediction model.
#### Research Status and Challenges
Most of the existing user - level differential privacy research assumes that the number of samples provided by each user is the same, and this assumption does not hold in many practical scenarios. For example, in federated learning applications, the number of samples of different users often varies greatly. Therefore, how to conduct effective mean estimation in the case of heterogeneous user sample numbers has become an urgent problem to be solved.
#### Main Contributions of the Paper
1. **Propose a new framework**: This paper proposes a more realistic user - level LDP framework that allows users to contribute through local data sets of different sizes.
2. **Design a new algorithm**: For the one - dimensional mean estimation task, an algorithm called Distribution - Aware Mean Estimation (DAME) is proposed. The DAME algorithm is divided into two stages: the localization stage and the estimation stage.
- **Localization stage**: Through the random response mechanism, the private data of the first half of the users is discretized and privatized to identify the candidate intervals containing the true mean.
- **Estimation stage**: The private data of the remaining users is projected, and noise is added through the Laplace mechanism to ensure privacy.
3. **Theoretical analysis**: The non - asymptotic upper and lower bounds of the worst - case risk are derived, and it is proved that the DAME algorithm is optimal in many cases.
#### Summary of Mathematical Formulas
- **Sample mean**:
\[
\bar{X}^{(u)}_{m_u}=\frac{1}{m_u}\sum_{t = 1}^{m_u}X^{(u)}_t
\]
- **Indicator vector**:
\[
V^{(u)}_j=\mathbb{1}\left\{\bar{X}^{(u)}_{m_u}\in\bigcup_{k\in\{j - 1,j,j + 1\}}I_k\right\}
\]
- **Candidate interval selection**:
\[
\hat{j}=\arg\max_{j\in\left[\left\lceil\frac{1}{\tau}\right\rceil\right]}\sum_{u = 1}^{n/2}eV^{(u)}_j
\]
- **Local mean estimation in the estimation stage**:
\[
\hat{X}^{(u)}_{\hat{j}}=\sqrt{\frac{m_u\wedge\tilde{m}}{\tilde{m}}}\bar{X}^{(u)}_{m_u}+\sqrt{\frac{\tilde{m}}{m_u\wedge\tilde{m}-1}}s_{\hat{j}}