Abstract:Latent Dirichlet allocation (LDA) is a topic model widely used for discovering hidden semantics in massive text corpora. Collapsed Gibbs sampling (CGS), as a widely-used algorithm for learning the parameters of LDA, has the risk of privacy leakage. Specifically, word count statistics and updates of latent topics in CGS, which are essential for parameter estimation, could be employed by adversaries to conduct effective membership inference attacks (MIAs). Till now, there are two kinds of methods exploited in CGS to defend against MIAs: adding noise to word count statistics and utilizing inherent privacy. These two kinds of methods have their respective limitations. Noise sampled from the Laplacian distribution sometimes produces negative word count statistics, which render terrible parameter estimation in CGS. Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs. It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy. The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived. It is the first time that Rényi differential privacy (RDP) has been introduced into CGS and we propose RDP-LDA, an effective framework for analyzing the privacy loss of any differentially private CGS. RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained by ε -DP. In RDP-LDA, we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative. And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy. Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.

On Privacy Protection of Latent Dirichlet Allocation Model Training.

Latent Dirichlet Allocation Model Training With Differential Privacy

Improving Privacy Guarantee and Efficiency of Latent Dirichlet Allocation Model Training under Differential Privacy.

Privacy-Preserving Collaborative Deep Learning with Unreliable Participants.

Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training under Rényi Differential Privacy

Private Knowledge Transfer via Model Distillation with Generative Adversarial Networks

FDP-LDA: Inherent Privacy Amplification of Collapsed Gibbs Sampling Via Group Subsampling.

A New Noise Generating Method Based on Gaussian Sampling for Privacy Preservation

Not Just Cloud Privacy: Protecting Client Privacy in Teacher-Student Learning

Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning

Differentially Private Low-Rank Adaptation of Large Language Model Using Federated Learning

An end-to-end Differentially Private Latent Dirichlet Allocation Using a Spectral Algorithm

PPCL: Privacy-preserving collaborative learning for mitigating indirect information leakage

An Improved Privacy-Preserving Stochastic Gradient Descent Algorithm

DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

Differentially private regression analysis with dynamic privacy allocation

Local Differential Privacy for data collection and analysis

A General Framework for Auditing Differentially Private Machine Learning

Membership Inference Attacks and Privacy in Topic Modeling

Protection Against Reconstruction and Its Applications in Private Federated Learning

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification