On Anonymizing Medical Microdata with Large-Scale Missing Values - A Case Study with the FAERS Dataset

Mei-Hui Hsiao,Wen-Yang Lin,Kuang-Yung Hsu,Zih-Xun Shen
DOI: https://doi.org/10.1109/EMBC.2019.8857025
Abstract:As big data analysis becomes one of the main driving forces for productivity and economic growth, the concern of individual privacy disclosure increases as well, especially for applications accessing medical or health data that contain personal information. Most contemporary techniques for privacy preserving data publishing follow a simple assumption-the data of concern is complete, i.e., containing no missing values, which however is not the case in the real world. This paper presents our endeavors on inspecting the effect of missing values upon medical data privacy. In particular, we inspected the US FAERS dataset, a public dataset containing adverse drug events released by US FDA. Following the presumption of current anonymization paradigm-the data should contain no missing values, we investigated three intuitive strategies, including or excluding missing values or executing imputation, to anonymize the FAERS dataset. Our results demonstrate the awkwardness of these intuitive strategies in handling data with a massive amount of missing values. Accordingly, we propose a new strategy, consolidation, and the corresponding privacy protection model and anonymization algorithm. Experimental results show that our method can prevent privacy disclosure and sustain the data utility for ADR signal detection.
What problem does this paper attempt to address?