Abstract:Annotating data for sensitive labels (e.g., disease, smoking) poses a potential threats to individual privacy in many real-world scenarios. To cope with this problem, we propose a novel setting to protect privacy of each instance, namely learning from concealed labels for multi-class classification. Concealed labels prevent sensitive labels from appearing in the label set during the label collection stage, which specifies none and some random sampled insensitive labels as concealed labels set to annotate sensitive data. In this paper, an unbiased estimator can be established from concealed data under mild assumptions, and the learned multi-class classifier can not only classify the instance from insensitive labels accurately but also recognize the instance from the sensitive labels. Moreover, we bound the estimation error and show that the multi-class classifier achieves the optimal parametric convergence rate. Experiments demonstrate the significance and effectiveness of the proposed method for concealed labels in synthetic and real-world datasets.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to protect the privacy of sensitive labels (such as diseases, smoking, etc.) during the data labeling process while ensuring the effective training of machine - learning models. Specifically, the author proposes a new weakly - supervised learning framework - learning from Concealed Labels, aiming to prevent sensitive labels from appearing in the labeled set, thereby protecting individual privacy.
### Problem Background
In many real - world scenarios, it is very difficult to obtain large - scale datasets with accurate supervision information, especially when sensitive information is involved. For example, in medical and health classification tasks, patients may be unwilling to disclose their sensitive information (such as disease history or smoking habits), which makes it very difficult to obtain these labels directly. Traditional methods usually completely hide all labels to protect privacy, but this increases the difficulty of training classifiers because of the lack of precisely labeled data.
### The Solution Proposed in the Paper
To solve this problem, the author proposes a new weakly - supervised learning setting - Learning from Concealed Labels. In this setting, sensitive labels do not appear in the labeled set, but data are labeled by introducing the "none label" and some randomly sampled non - sensitive labels. This method not only protects sensitive information but also allows the model to use some non - sensitive labels for training.
### Main Contributions
1. **Propose a new privacy - protected weakly - supervised learning setting**: The author proposes a new weakly - supervised learning method, that is, learning from Concealed Labels, to prevent sensitive labels from appearing in the labeled set.
2. **Construct an unbiased risk estimator**: The author proposes an empirical risk minimization method, constructs an unbiased estimator using Concealed Labels data, and provides the bounds of the estimation error.
3. **Experimental verification**: Through experiments on multiple benchmark datasets and two real - world Concealed Labels datasets, the effectiveness and superiority of the proposed method are proved.
### Formulas and Theoretical Analysis
To better understand the working principle of this method, the following are the key formulas and theoretical analysis:
- **Conditional Distribution Assumption**:
\[
P(s = s_{\text{none}}|x, y = c_l)=1
\]
\[
P(s = j\neq \{i\wedge s_{\text{none}}\}|x, y = i)=0
\]
\[
P(s = s_{\text{none}}|x, y = 1)=P(s = s_{\text{none}}|x, y = 2)=\cdots = P(s = s_{\text{none}}|x, y = K)=\frac{K - L}{K}
\]
- **Unbiased Risk Estimator**:
\[
R_{CL}(f)=\mathbb{E}_{(x, s)\sim P(x, s\neq s_{\text{none}})}\left[\frac{K}{L}L(f(x), s)\right]+\mathbb{E}_{(x, s)\sim P(x, s = s_{\text{none}})}\left[\frac{K}{L}L(f(x), c_l)\right]-\mathbb{E}_M\left[\frac{K - L}{L}L(f(x), c_l)\right]
\]
- **Modified Risk Estimator**:
\[
\hat{R}_{gCL}(f)=\frac{1}{\#X_s}\sum_{s = 1}^K\sum_{x_j\in X_s}\frac{K}{L}L(f(x_j), s)+g\left(\frac{1}{\#X_{\text{none}}}\sum_{x_j\in X_{\text{none}}}\frac{K}{L}L(f(x_j), c_l)-\frac{1}{\#X_c}\sum_{x_j\in X_c}
\]