Privacy Preserving Analytics on Distributed Medical Data

Marina Blanton,Ah Reum Kang,Subhadeep Karan,Jaroslaw Zola
DOI: https://doi.org/10.48550/arXiv.1806.06477
2018-06-18
Abstract:Objective: To enable privacy-preserving learning of high quality generative and discriminative machine learning models from distributed electronic health records. Methods and Results: We describe general and scalable strategy to build machine learning models in a provably privacy-preserving way. Compared to the standard approaches using, e.g., differential privacy, our method does not require alteration of the input biomedical data, works with completely or partially distributed datasets, and is resilient as long as the majority of the sites participating in data processing are trusted to not collude. We show how the proposed strategy can be applied on distributed medical records to solve the variables assignment problem, the key task in exact feature selection and Bayesian networks learning. Conclusions: Our proposed architecture can be used by health care organizations, spanning providers, insurers, researchers and computational service providers, to build robust and high quality predictive models in cases where distributed data has to be combined without being disclosed, altered or otherwise compromised.
Cryptography and Security
What problem does this paper attempt to address?
This paper aims to solve the problem of privacy - protected analysis on distributed medical data. Specifically, the goal of the paper is to construct high - quality generative and discriminative machine - learning models from distributed electronic health records without disclosing or modifying the original data. This involves developing a general and scalable method to build machine - learning models in a provably privacy - protected manner. Compared with traditional privacy - protection methods (such as differential privacy), the proposed method does not require any modification of the input biomedical data, can handle fully or partially distributed data sets, and can ensure data privacy as long as most of the sites involved in data processing are trusted not to collude. In addition, the paper shows how to apply the proposed strategy to distributed medical records to solve the variable - assignment problem, which is a key task in accurate feature selection and Bayesian - network learning. In this way, the paper provides a mechanism that enables medical institutions, insurance companies, researchers, and computing - service providers to combine distributed data to build robust and high - quality prediction models without disclosing, modifying, or otherwise compromising the data.