On PAC Learning Halfspaces in Non-interactive Local Privacy Model with Public Unlabeled Data

Jinyan Su,Jinhui Xu,Di Wang
DOI: https://doi.org/10.48550/arXiv.2209.08319
2022-09-17
Abstract:In this paper, we study the problem of PAC learning halfspaces in the non-interactive local differential privacy model (NLDP). To breach the barrier of exponential sample complexity, previous results studied a relaxed setting where the server has access to some additional public but unlabeled data. We continue in this direction. Specifically, we consider the problem under the standard setting instead of the large margin setting studied before. Under different mild assumptions on the underlying data distribution, we propose two approaches that are based on the Massart noise model and self-supervised learning and show that it is possible to achieve sample complexities that are only linear in the dimension and polynomial in other terms for both private and public data, which significantly improve the previous results. Our methods could also be used for other private PAC learning problems.
Machine Learning
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper aims to study how to effectively perform PAC learning of half - spaces using publicly unlabeled data in the non - interactive local differential privacy model (NLDP). Specifically, the authors hope to overcome the obstacle of exponential - level sample complexity and propose methods that can significantly outperform previous results in terms of sample complexity. #### Background and motivation 1. **Privacy protection requirements**: With the large - scale generation and collection of sensitive data, how to use these data for analysis without exposing personal privacy has become an important issue. For this reason, differential privacy (DP) has become a de - facto privacy protection tool. 2. **Existing challenges**: In the NLDP model, due to the limitation of the number of communication rounds, the theoretical behavior is more challenging than other models. In particular, Daniely and Feldman (2019) proved that even under the large - margin assumption, learning half - spaces requires exponential - level sample complexity. To solve this problem, Daniely and Feldman introduced a relaxed NLDP model in which the server can access some publicly but unlabeled data. 3. **Improvement goals**: This paper attempts to further reduce the sample complexity under the standard setting (rather than the large - margin setting), especially when using publicly unlabeled data, so that the sample complexity depends linearly on the dimension and other polynomial terms. #### Main contributions 1. **Anti - anti - concentration property**: The authors first studied the situation where the data distribution satisfies the anti - anti - concentration and anti - concentration properties, and proposed an (ε, δ)-NLDP algorithm based on the Massart noise model, achieving linear sample complexity. 2. **Self - supervised learning**: To further reduce the sample complexity of public data, the authors studied the self - supervised learning method in the case of mixed distributions and proposed an algorithm that can achieve O(d/α²) sample complexity. ### Formula summary - **Sample complexity**: - Private data: \(\tilde{O}(d\cdot\text{Poly}(1/\epsilon, 1/\alpha))\) - Publicly unlabeled data: \(O(d/\alpha^4)\) or \(O(d/\alpha^2)\) - **Massart noise model**: - The probability that each sample label is flipped does not exceed λ < 1/2, that is: \[ y = \begin{cases} f(x), & \text{with probability } 1 - \lambda(x)\\ - f(x), & \text{with probability } \lambda(x) \end{cases} \] where \(\lambda(x)\leq\lambda\). - **Anti - anti - concentration property**: - For any probability density function γ_V projected onto a 2 - dimensional subspace V, it satisfies: \[ \gamma_V(x)\leq U\quad\forall x\in V \] and for all points with \(\|x\|_2\leq r\), it satisfies: \[ \gamma_V(x)\geq\frac{1}{U} \] Through these improvements, this paper greatly improves the efficiency and accuracy of PAC learning of half - spaces while ensuring privacy.