On the Trade-Off between Fidelity, Utility and Privacy of Synthetic Patient Data

Tim Adams,Colin Birkenbihl,Karen Otte,Hwei Geok Ng,Jonas Adrian Rieling,Anatol-Fiete Naeher,Ulrich Sax,Fabian Prasser,Holger Froehlich
DOI: https://doi.org/10.1101/2024.12.06.24317239
2024-12-08
Abstract:The advancement of medical research and healthcare is increasingly dependent on the analysis of patient-level data, but privacy concerns and legal constraints often hinder data sharing. Synthetic data mimicking real patient data offers a widely discussed potential solution. According to the literature, synthetic data may, however, not fully guarantee patient privacy and can vary greatly in terms of fidelity and utility. In this study, we aim to systematically investigate the trade-off between privacy, fidelity and utility of synthetic patient data. We assess synthetic data fidelity in terms of statistical similarity to real data, and utility via the performance of machine learning models trained on synthetic and tested on real data. Regarding data privacy we focus on membership inference via shadow model attacks as well as singling out and attribute inference risks. In this regard, we also consider differential privacy (DP) as a possible mechanism to probabilistically guarantee a certain level of data privacy, and we compare against classical anonymization techniques. We evaluate the fidelity, utility and privacy of synthetic data generated by five different models for three distinctive patient-level datasets. Our results show that our implementations of DP have a strongly detrimental effect on the fidelity of synthetic data, specifically its correlation structure, and therefore emphasize the need to improve techniques that effectively balance privacy, fidelity and utility in synthetic patient data generation.
What problem does this paper attempt to address?