Abstract:Label differential privacy (DP) is a framework that protects the privacy of labels in training datasets, while the feature vectors are public. Existing approaches protect the privacy of labels by flipping them randomly, and then train a model to make the output approximate the privatized label. However, as the number of classes $K$ increases, stronger randomization is needed, thus the performances of these methods become significantly worse. In this paper, we propose a vector approximation approach, which is easy to implement and introduces little additional computational overhead. Instead of flipping each label into a single scalar, our method converts each label into a random vector with $K$ components, whose expectations reflect class conditional probabilities. Intuitively, vector approximation retains more information than scalar labels. A brief theoretical analysis shows that the performance of our method only decays slightly with $K$. Finally, we conduct experiments on both synthesized and real datasets, which validate our theoretical analysis as well as the practical performance of our method.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that under the Label Differential Privacy (Label DP) framework, when the number of classes \(K\) is large, the performance of existing methods drops significantly. Specifically:
1. **Background and Problem Description**:
- In supervised learning, label differential privacy aims to protect the privacy of labels in the training dataset, while the feature vectors are public.
- Existing methods protect privacy by randomly flipping labels, but as the number of classes \(K\) increases, stronger randomization is required, resulting in a significant drop in model performance.
2. **Limitations of Existing Methods**:
- Existing methods such as Randomized Response, RRWithPrior, and ALIBI protect privacy by converting labels into a single scalar. However, in the multi - class case, the performance of these methods drops sharply as \(K\) increases.
- From an information - theoretic perspective, a single scalar can only convey limited information. Therefore, as \(K\) increases, it becomes increasingly difficult to maintain the statistical dependence between the original labels and the privatized labels, resulting in a drop in model performance.
3. **Method Proposed in the Paper**:
- The authors propose a label differential privacy method based on vector approximation. Specifically, each label is converted into a random vector \(Z=(Z(1),\dots,Z(K))\in \{0, 1\}^K\), where the expectation of \(Z(j)\) reflects the conditional class probability.
- This method retains more information, especially when \(K\) is large, and thus can achieve better performance.
4. **Theoretical Analysis and Experimental Verification**:
- The paper provides a brief theoretical analysis, indicating that the performance of this method will only decline slightly as \(K\) increases.
- The experimental results on synthetic data and standard benchmark datasets verify the validity of the theoretical analysis, showing that this method is significantly superior to existing methods when \(K\) is large.
In summary, the main contribution of this paper is to propose a new label differential privacy method based on vector approximation, which solves the problem of significant performance degradation of existing methods in multi - class classification tasks, and verifies its effectiveness both theoretically and experimentally.