Near-Optimal Privacy-Utility Tradeoff in Genomic Studies Using Selective SNP Hiding

Nour Almadhoun Alserr,Gulce Kale,Onur Mutlu,Oznur Tastan,Erman Ayday
DOI: https://doi.org/10.48550/arXiv.2106.05211
2021-06-10
Abstract:Motivation: Researchers need a rich trove of genomic datasets that they can leverage to gain a better understanding of the genetic basis of the human genome and identify associations between phenotypes and specific parts of DNA. However, sharing genomic datasets that include sensitive genetic or medical information of individuals can lead to serious privacy-related consequences if data lands in the wrong hands. Restricting access to genomic datasets is one solution, but this greatly reduces their usefulness for research purposes. To allow sharing of genomic datasets while addressing these privacy concerns, several studies propose privacy-preserving mechanisms for data sharing. Differential privacy (DP) is one of such mechanisms that formalize rigorous mathematical foundations to provide privacy guarantees while sharing aggregated statistical information about a dataset. However, it has been shown that the original privacy guarantees of DP-based solutions degrade when there are dependent tuples in the dataset, which is a common scenario for genomic datasets (due to the existence of family members). Results: In this work, we introduce a near-optimal mechanism to mitigate the vulnerabilities of the inference attacks on differentially private query results from genomic datasets including dependent tuples. We propose a utility-maximizing and privacy-preserving approach for sharing statistics by hiding selective SNPs of the family members as they participate in a genomic dataset. By evaluating our mechanism on a real-world genomic dataset, we empirically demonstrate that our proposed mechanism can achieve up to 40% better privacy than state-of-the-art DP-based solutions, while near-optimally minimizing the utility loss.
Cryptography and Security,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively share genomic data while protecting personal privacy in genomics research. Specifically, the paper focuses on how to reduce the risk of inference attacks by selectively hiding single - nucleotide polymorphisms (SNPs) when genomic datasets contain data of family members, while minimizing the impact on data utility. Traditional Differential Privacy (DP) methods have reduced privacy guarantees when dealing with related data (such as data between family members) because these methods assume that all entries in the dataset are independent. Therefore, this paper proposes a new mechanism that can provide near - optimal privacy protection and maintain high data utility by selectively hiding some SNPs. ### Specific problem description 1. **Privacy risks**: Genomic datasets contain sensitive genetic or medical information. If this data falls into the hands of criminals, it may lead to serious privacy problems. In particular, when the dataset contains data of family members, due to the high similarity of genes between family members, traditional differential privacy techniques may not be able to effectively protect privacy. 2. **Utility loss**: In order to protect privacy, existing differential privacy techniques usually need to add a large amount of noise, which will lead to a significant decline in data utility and affect the quality and effectiveness of scientific research. ### Solution This paper proposes a method of selectively hiding SNPs, aiming to: - **Reduce the risk of inference attacks**: By selectively hiding some SNPs, reduce the kinship estimates between family members, thereby reducing potential privacy leakage. - **Maintain data utility**: While protecting privacy, minimize the impact on data utility as much as possible to ensure that researchers can obtain high - quality statistical information. ### Method overview 1. **Selective hiding**: For each family member newly added to the dataset, selectively hide some specific SNP positions to reduce the kinship estimates between family members. 2. **Optimization model**: Use an integer programming model to determine the number of SNPs to be hidden, ensuring that the number of hidden SNPs is minimized while meeting privacy constraints. 3. **Overlapping area first**: Give priority to selecting SNPs to be hidden from the overlapping areas between family members. This can more effectively reduce kinship estimates while maintaining high data utility. ### Mathematical model In order to reduce the kinship coefficient, the selectively hidden positions are based on SNP configurations. Assume that an individual \(j\) has an SNP configuration \(s_j\) at a certain genomic position, where \(s_j\) can take values {0, 1, 2}. Denote the total number of positions of an individual under different SNP configurations as \(n_{s_j}\). For the kinship coefficient \(\phi_{ik}\) between two individuals \(i\) and \(k\), use the robust kinship estimator proposed by Manichaikul et al: \[ \phi_{ik}=\frac{2n_{11}-4(n_{02}+n_{20})-n_{1*}+n_{*1}}{4n_{1*}} \] where: - \(n_{11}\) represents the number of positions where both individuals are heterozygotes. - \(n_{20}\) and \(n_{02}\) represent the number of positions where the first individual is homozygous dominant and the second individual is homozygous recessive, respectively. - \(n_{1*}\) and \(n_{*1}\) represent the number of positions where the first individual and the second individual are heterozygotes at this position, respectively. In order to reduce the kinship coefficient to the preset value \(\phi'_{ik}\), the number of heterozygous positions \(x_{11}\) to be hidden can be calculated by the following formula: \[ x_{11}=\frac{2n_{11}-4(n_{02}+n_{20})-n_{1*}+n_{*1}(1 - 4\phi'_{ik})}{2(1 - 2\phi'_{ik})} \] In order to make the kinship coefficient lower than the preset value \(\Phi\), it can be modeled as an integer.