Abstract:Author(s): Song, Shuang | Advisor(s): Chaudhuri, Kamalika | Abstract: Modern machine learning increasingly involves personal data, such as healthcare, financial and user behavioral information. However, models trained on such data can reveal detailed information about the data and cause a serious privacy breach. Consequently, it is important to design algorithms that can analyze the sensitive data while still preserving privacy. This thesis advances the state-of-the-art of privacy-preserving machine learning in the following two major aspects.First, this thesis addresses the challenges in differentially private large-scale machine learning. On the one hand, with a large amount of sensitive user data, privacy-preserving learning algorithms are expected to achieve improved utility. On the other hand, big data imposes additional challenges, including performance (a.k.a Volume), data noise (a.k.a Veracity), a large number of classes and distributed sources (a.k.a Variety). This thesis presents (1) private versions of the widely used stochastic gradient descent (SGD) algorithm with generalization to data from multiple sources of different privacy requirements, and (2) an improved version of the Private Aggregation of Teacher Ensembles (PATE) framework which can scale to learning tasks with a large number of output classes and uncurated, imbalanced training data.Second, this thesis considers privacy-preserving data analysis beyond tabular data. Differential privacy is best suited for tabular data, where each record corresponds to all the information about an individual and records are independent of each other. However, many real-world applications involve non-tabular sensitive data, such as epidemic transmission graphs and measurement of the physical activity of a single subject across time. To analyze disease transmission graphs, or more general, graphs with sensitive information stored in each node, this thesis considers privacy-preserving continual release of graph statistics, such as the percentages of highly-active patients over time. The proposed algorithm outperforms the baselines in utility over a range of parameters. For physical activity measurement, or more generally, data with correlation, this thesis looks at a recent generalization of differential privacy, called Pufferfish privacy, that addresses privacy concerns in correlated data. Two mechanisms that work under different scenarios are proposed, and one of them is evaluated on real and synthetic time-series data.

Privacy Preserving Analytics on Distributed Medical Data

Privacy-Preserving Machine Learning Algorithms for Big Data Systems

Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care

Secure and Differentially Private Bayesian Learning on Distributed Data

Report: State of the Art Solutions for Privacy Preserving Machine Learning in the Medical Context

Privacy-Preserving Algorithms for Machine Learning

Differential Privacy-enabled Federated Learning for Sensitive Health Data

Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings

Privacy Protection of Healthcare Data over Social Networks Using Machine Learning Algorithms

Generation of Differentially Private Heterogeneous Electronic Health Records

Privacy-Preserving Generalized Linear Models using Distributed Block Coordinate Descent

ModelChain: Decentralized Privacy-Preserving Healthcare Predictive Modeling Framework on Private Blockchain Networks

Guaranteed Distributed Machine Learning: Privacy-preserving Empirical Risk Minimization

Distributed non-disclosive validation of predictive models by a modified ROC-GLM

Generative Data Augmentation with Differential Privacy for Non-Iid Problem in Decentralized Clinical Machine Learning

Gradient Mechanism to Preserve Differential Privacy and Deter Against Model Inversion Attacks in Healthcare Analytics

Privacy Preserving Machine Learning for Electronic Health Records using Federated Learning and Differential Privacy

Global Differential Privacy for Distributed Metaverse Healthcare Systems

Spatio-Temporal Split Learning for Privacy-Preserving Medical Platforms: Case Studies with COVID-19 CT, X-Ray, and Cholesterol Data

Private Empirical Risk Minimization with Analytic Gaussian Mechanism for Healthcare System

Preserving Differential Privacy Between Features in Distributed Estimation