Privacy Preserving Data Imputation via Multi-party Computation for Medical Applications

Julia Jentsch,Ali Burak Ünal,Şeyma Selcan Mağara,Mete Akgün
2024-05-29
Abstract:Handling missing data is crucial in machine learning, but many datasets contain gaps due to errors or non-response. Unlike traditional methods such as listwise deletion, which are simple but inadequate, the literature offers more sophisticated and effective methods, thereby improving sample size and accuracy. However, these methods require accessing the whole dataset, which contradicts the privacy regulations when the data is distributed among multiple sources. Especially in the medical and healthcare domain, such access reveals sensitive information about patients. This study addresses privacy-preserving imputation methods for sensitive data using secure multi-party computation, enabling secure computations without revealing any party's sensitive information. In this study, we realized the mean, median, regression, and kNN imputation methods in a privacy-preserving way. We specifically target the medical and healthcare domains considering the significance of protection of the patient data, showcasing our methods on a diabetes dataset. Experiments on the diabetes dataset validated the correctness of our privacy-preserving imputation methods, yielding the largest error around $3 \times 10^{-3}$, closely matching plaintext methods. We also analyzed the scalability of our methods to varying numbers of samples, showing their applicability to real-world healthcare problems. Our analysis demonstrated that all our methods scale linearly with the number of samples. Except for kNN, the runtime of all our methods indicates that they can be utilized for large datasets.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
This paper explores how to achieve privacy-preserving data imputation (data missing value processing) in healthcare applications through Multi-Party Computation (MPC). Traditional data imputation methods, such as complete deletion of lists, may be insufficient and may lead to bias, while more complex methods require access to the complete dataset, which may violate privacy regulations, especially when sensitive medical data is involved. Therefore, the researchers propose a method using MPC that can perform secure computation without leaking any sensitive information of the participants. The paper implements privacy-preserving versions of four data imputation methods: mean imputation, median imputation, regression imputation, and k-Nearest Neighbors (kNN) imputation. These methods are designed for the healthcare field, taking into consideration the importance of protecting patient data, and they are validated through experiments on a diabetes dataset. The experimental results show that the privacy-preserving data imputation methods have very small errors compared to plaintext methods, with a maximum error of approximately 3×10^-3, demonstrating the correctness of the methods. In addition, the paper analyzes the scalability of these methods under different sample sizes. All methods scale linearly with the sample size, except for kNN imputation, which has longer running time due to the need to calculate distances between all pairs of samples. Nevertheless, these methods remain practical for large datasets. In conclusion, this paper aims to address the issue of missing values in healthcare data while protecting individual privacy, and it provides a secure and practical solution for data preprocessing in the medical and health field.