Privacy Preserving Data Imputation via Multi-party Computation for Medical Applications

Julia Jentsch,Ali Burak Ünal,Şeyma Selcan Mağara,Mete Akgün

2024-05-29

Abstract:Handling missing data is crucial in machine learning, but many datasets contain gaps due to errors or non-response. Unlike traditional methods such as listwise deletion, which are simple but inadequate, the literature offers more sophisticated and effective methods, thereby improving sample size and accuracy. However, these methods require accessing the whole dataset, which contradicts the privacy regulations when the data is distributed among multiple sources. Especially in the medical and healthcare domain, such access reveals sensitive information about patients. This study addresses privacy-preserving imputation methods for sensitive data using secure multi-party computation, enabling secure computations without revealing any party's sensitive information. In this study, we realized the mean, median, regression, and kNN imputation methods in a privacy-preserving way. We specifically target the medical and healthcare domains considering the significance of protection of the patient data, showcasing our methods on a diabetes dataset. Experiments on the diabetes dataset validated the correctness of our privacy-preserving imputation methods, yielding the largest error around $3 \times 10^{-3}$, closely matching plaintext methods. We also analyzed the scalability of our methods to varying numbers of samples, showing their applicability to real-world healthcare problems. Our analysis demonstrated that all our methods scale linearly with the number of samples. Except for kNN, the runtime of all our methods indicates that they can be utilized for large datasets.

Cryptography and Security,Machine Learning

What problem does this paper attempt to address?

This paper explores how to achieve privacy-preserving data imputation (data missing value processing) in healthcare applications through Multi-Party Computation (MPC). Traditional data imputation methods, such as complete deletion of lists, may be insufficient and may lead to bias, while more complex methods require access to the complete dataset, which may violate privacy regulations, especially when sensitive medical data is involved. Therefore, the researchers propose a method using MPC that can perform secure computation without leaking any sensitive information of the participants. The paper implements privacy-preserving versions of four data imputation methods: mean imputation, median imputation, regression imputation, and k-Nearest Neighbors (kNN) imputation. These methods are designed for the healthcare field, taking into consideration the importance of protecting patient data, and they are validated through experiments on a diabetes dataset. The experimental results show that the privacy-preserving data imputation methods have very small errors compared to plaintext methods, with a maximum error of approximately 3×10^-3, demonstrating the correctness of the methods. In addition, the paper analyzes the scalability of these methods under different sample sizes. All methods scale linearly with the sample size, except for kNN imputation, which has longer running time due to the need to calculate distances between all pairs of samples. Nevertheless, these methods remain practical for large datasets. In conclusion, this paper aims to address the issue of missing values in healthcare data while protecting individual privacy, and it provides a secure and practical solution for data preprocessing in the medical and health field.

Privacy Preserving Data Imputation via Multi-party Computation for Medical Applications

Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records

What is Hiding in Medicine's Dark Matter? Learning with Missing Data in Medical Practices

Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis

On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

Improved clinical data imputation via classical and quantum determinantal point processes

Personalized Imputation in metric spaces via conformal prediction: Applications in Predicting Diabetes Development with Continuous Glucose Monitoring Information

Web-Based Privacy-Preserving Multicenter Medical Data Analysis Tools Via Threshold Homomorphic Encryption: Design and Development Study

Exploring Privacy-Preserving Disease Diagnosis: A Comparative Analysis

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Privacy-preserving medical diagnosis system with Gaussian kernel-based support vector machine

Privacy-Preserving Methods for Vertically Partitioned Incomplete Data

Can large language models be privacy preserving and fair medical coders?

Benchmarking Machine Learning Missing Data Imputation Methods in Large-Scale Mental Health Survey Databases

Tuberculosis of the small intestine.

SICE: an improved missing data imputation technique

Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?

Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings

[Contribution to the knowledge of the efficiency of the respiratory function in a group of school-age children with nanism].

A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities