Reproducibility-Oriented and Privacy-Preserving Genomic Dataset Sharing

Yuzhou Jiang,Tianxi Ji,Pan Li,Erman Ayday
2024-08-28
Abstract:As genomic research has become increasingly widespread in recent years, few studies have shared datasets due to the privacy concerns about the genomic records. This hinders the reproduction and validation of research outcomes, which are crucial for catching errors, e.g., miscalculations, during the research process. To address the reproducibility issue of genome-wide association studies (GWAS) outcomes, we propose an innovative method that involves a differential privacy-based scheme for sharing genomic datasets. The proposed scheme involves two stages. In the first stage, we generate a noisy copy of the target dataset by applying an optimized version of a previously proposed XOR mechanism on the binarized (encoded) dataset, where the binary noise generation considers biological features. However, the initial step introduces significant noise, making the dataset less suitable for direct GWAS outcome validation. Thus, in the second stage, we implement a post-processing technique that adjusts the Minor Allele Frequency values (MAFs) in the noisy dataset to align more closely with public MAF information using optimal transport, and then decode it back to genomic space. We evaluate the proposed scheme on three real-life genomic datasets and compare it with a baseline approach (local differential privacy) and two synthesis-based solutions with regard to GWAS outcome validation, data utility, and resistance against membership inference attacks (MIAs). We show that our proposed scheme outperforms all other methods in detecting GWAS outcome errors, achieves better utility, and provides higher privacy protection against membership inference attacks (MIAs). By utilizing our method, genomic researchers will be inclined to share a differentially private, yet of high quality version of their datasets.
Cryptography and Security
What problem does this paper attempt to address?