Abstract:Annotating data for sensitive labels (e.g., disease, smoking) poses a potential threats to individual privacy in many real-world scenarios. To cope with this problem, we propose a novel setting to protect privacy of each instance, namely learning from concealed labels for multi-class classification. Concealed labels prevent sensitive labels from appearing in the label set during the label collection stage, which specifies none and some random sampled insensitive labels as concealed labels set to annotate sensitive data. In this paper, an unbiased estimator can be established from concealed data under mild assumptions, and the learned multi-class classifier can not only classify the instance from insensitive labels accurately but also recognize the instance from the sensitive labels. Moreover, we bound the estimation error and show that the multi-class classifier achieves the optimal parametric convergence rate. Experiments demonstrate the significance and effectiveness of the proposed method for concealed labels in synthetic and real-world datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to protect the privacy of sensitive labels (such as diseases, smoking, etc.) during the data labeling process while ensuring the effective training of machine - learning models. Specifically, the author proposes a new weakly - supervised learning framework - learning from Concealed Labels, aiming to prevent sensitive labels from appearing in the labeled set, thereby protecting individual privacy. ### Problem Background In many real - world scenarios, it is very difficult to obtain large - scale datasets with accurate supervision information, especially when sensitive information is involved. For example, in medical and health classification tasks, patients may be unwilling to disclose their sensitive information (such as disease history or smoking habits), which makes it very difficult to obtain these labels directly. Traditional methods usually completely hide all labels to protect privacy, but this increases the difficulty of training classifiers because of the lack of precisely labeled data. ### The Solution Proposed in the Paper To solve this problem, the author proposes a new weakly - supervised learning setting - Learning from Concealed Labels. In this setting, sensitive labels do not appear in the labeled set, but data are labeled by introducing the "none label" and some randomly sampled non - sensitive labels. This method not only protects sensitive information but also allows the model to use some non - sensitive labels for training. ### Main Contributions 1. **Propose a new privacy - protected weakly - supervised learning setting**: The author proposes a new weakly - supervised learning method, that is, learning from Concealed Labels, to prevent sensitive labels from appearing in the labeled set. 2. **Construct an unbiased risk estimator**: The author proposes an empirical risk minimization method, constructs an unbiased estimator using Concealed Labels data, and provides the bounds of the estimation error. 3. **Experimental verification**: Through experiments on multiple benchmark datasets and two real - world Concealed Labels datasets, the effectiveness and superiority of the proposed method are proved. ### Formulas and Theoretical Analysis To better understand the working principle of this method, the following are the key formulas and theoretical analysis: - **Conditional Distribution Assumption**: \[ P(s = s_{\text{none}}|x, y = c_l)=1 \] \[ P(s = j\neq \{i\wedge s_{\text{none}}\}|x, y = i)=0 \] \[ P(s = s_{\text{none}}|x, y = 1)=P(s = s_{\text{none}}|x, y = 2)=\cdots = P(s = s_{\text{none}}|x, y = K)=\frac{K - L}{K} \] - **Unbiased Risk Estimator**: \[ R_{CL}(f)=\mathbb{E}_{(x, s)\sim P(x, s\neq s_{\text{none}})}\left[\frac{K}{L}L(f(x), s)\right]+\mathbb{E}_{(x, s)\sim P(x, s = s_{\text{none}})}\left[\frac{K}{L}L(f(x), c_l)\right]-\mathbb{E}_M\left[\frac{K - L}{L}L(f(x), c_l)\right] \] - **Modified Risk Estimator**: \[ \hat{R}_{gCL}(f)=\frac{1}{\#X_s}\sum_{s = 1}^K\sum_{x_j\in X_s}\frac{K}{L}L(f(x_j), s)+g\left(\frac{1}{\#X_{\text{none}}}\sum_{x_j\in X_{\text{none}}}\frac{K}{L}L(f(x_j), c_l)-\frac{1}{\#X_c}\sum_{x_j\in X_c} \]

Learning from Concealed Labels

Multi-label Learning from Privacy-Label

Learning Discrimination from Contaminated Data: Multi-Instance Learning for Unsupervised Anomaly Detection

Learning from Complementary Labels

Learning with Privileged Information for Multi-Label Classification

Privacy-Preserving Cost-Sensitive Learning

Handling New Class in Online Label Shift

Learning from Multi-Dimensional Partial Labels.

Classification Learning From Private Data In Heterogeneous Settings

Learning with Biased Complementary Labels

Learning to Conceal: A Deep Learning Based Method for Preserving Privacy and Avoiding Prejudice

Does Label Differential Privacy Prevent Label Inference Attacks?

Feature Selection for Classification under Anonymity Constraint

A Comprehensive Study of Privacy Risks in Curriculum Learning

Complementary Labels Learning with Augmented Classes

Learning Image Labels On-the-fly for Training Robust Classification Models

When graph convolution meets double attention: online privacy disclosure detection with multi-label text classification

Multiclass Learning with Partially Corrupted Labels.

A Universal Unbiased Method for Classification from Aggregate Observations

Privacy Preserving Naive Bayes Classification

Learning Privately from Multiparty Data