Calibrating Practical Privacy Risks for Differentially Private Machine Learning

Yuechun Gu,Keke Chen
2024-10-30
Abstract:Differential privacy quantifies privacy through the privacy budget $\epsilon$, yet its practical interpretation is complicated by variations across models and datasets. Recent research on differentially private machine learning and membership inference has highlighted that with the same theoretical $\epsilon$ setting, the likelihood-ratio-based membership inference (LiRA) attacking success rate (ASR) may vary according to specific datasets and models, which might be a better indicator for evaluating real-world privacy risks. Inspired by this practical privacy measure, we study the approaches that can lower the attacking success rate to allow for more flexible privacy budget settings in model training. We find that by selectively suppressing privacy-sensitive features, we can achieve lower ASR values without compromising application-specific data utility. We use the SHAP and LIME model explainer to evaluate feature sensitivities and develop feature-masking strategies. Our findings demonstrate that the LiRA $ASR^M$ on model $M$ can properly indicate the inherent privacy risk of a dataset for modeling, and it's possible to modify datasets to enable the use of larger theoretical $\epsilon$ settings to achieve equivalent practical privacy protection. We have conducted extensive experiments to show the inherent link between ASR and the dataset's privacy risk. By carefully selecting features to mask, we can preserve more data utility with equivalent practical privacy protection and relaxed $\epsilon$ settings. The implementation details are shared online at the provided GitHub URL \url{<a class="link-external link-https" href="https://anonymous.4open.science/r/On-sensitive-features-and-empirical-epsilon-lower-bounds-BF67/" rel="external noopener nofollow">this https URL</a>}.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are: 1. **How to select an appropriate privacy budget \(\epsilon\) value in differential privacy machine learning**: The existing differential privacy settings are too conservative, resulting in a significant decline in data utility. The author proposes a method based on actual privacy risks to guide the selection of \(\epsilon\) values, in order to achieve a better balance between data utility and privacy protection. 2. **How to modify the dataset to allow for more lenient \(\epsilon\) settings**: The author studies reducing the sensitivity of the dataset by selectively suppressing sensitive features, so that a larger \(\epsilon\) value can be used while maintaining data utility, achieving equivalent actual privacy protection. ### Paper Background and Motivation - **Differential privacy**: Differential privacy quantifies the degree of privacy protection by setting a privacy budget \(\epsilon\). However, the theoretical \(\epsilon\) setting is independent of the data and model in practical applications, which leads to overly conservative selection of \(\epsilon\) values in practical applications and affects the utility of the data. - **Membership Inference Attack (MIA)**: MIA can use the training model to infer whether a sample belongs to the training dataset. Through the Likelihood Ratio Attack (LiRA), the privacy risk of the model can be evaluated. - **Feature sensitivity**: By identifying and suppressing sensitive features, the sensitivity of the dataset can be reduced, allowing for the use of larger \(\epsilon\) values and improving data utility. ### Main Contributions 1. **For the first time, studied the data - and model - specific properties of LiRA attacks**: The author found that the attack success rate (ASR) of LiRA attacks is closely related to the dataset and the model and can be used as a practical indicator for selecting \(\epsilon\) values. 2. **Reduce dataset sensitivity through feature masking**: The author proposes to reduce the ASR value of the dataset by selectively suppressing sensitive features, so that a larger \(\epsilon\) value can be used without sacrificing data utility. 3. **Experimental verification of the effectiveness of the method**: The author conducted experiments on multiple datasets and proved that this method can significantly improve data utility while maintaining the same level of actual privacy protection. ### Method Overview - **ASR as a privacy risk indicator**: The author uses the attack success rate (ASR) of LiRA attacks as an indicator to measure the actual privacy risk of the dataset. A lower ASR value indicates a lower privacy risk of the dataset. - **Feature sensitivity analysis**: By using model interpretation techniques (such as SHAP and LIME), identify features that are sensitive to privacy tasks and utility tasks. - **Optimize feature masking**: Design an optimization problem to select the optimal feature masking strategy by maximizing data utility while controlling privacy loss. ### Experimental Results - **Changes in ASR on different datasets**: Experimental results show that there are significant differences in ASR values on different datasets and models. - **The impact of feature masking on data utility**: Through feature masking, the ASR value can be significantly reduced while maintaining high data utility. - **The effects of different feature masking strategies**: Different feature masking strategies have different impacts on data utility, and the optimized feature masking strategy can better balance privacy and utility. In conclusion, this paper proposes a practical method to reduce the sensitivity of the dataset by selectively suppressing sensitive features, so as to achieve more flexible privacy budget settings while maintaining data utility.