Generalized genomic data sharing for differentially private federated learning

Md Momin Al Aziz,Md Monowar Anjum,Noman Mohammed,Xiaoqian Jiang
DOI: https://doi.org/10.1016/j.jbi.2022.104113
Abstract:The success behind Machine Learning (ML) methods has largely been attributed to the quality and quantity of the available data which can spread across multiple owners. A Federated Learning (FL) from distributed datasets often provides a reliable solution that provides valuable insight. For a genomic dataset, such data have also proven to be sensitive which requires additional safety mechanisms before any sharing or ML operations. We propose a generalized gene expression data sharing method using a differentially private mechanism. Due to the large number of genes available, the data dimension is also reduced to accommodate smaller privacy budgets as we utilize an exponential mechanism to create a private histogram from numeric expression data. The output histogram can be used in any federated machine learning setting having multiple data owners. The proposed solution was submitted to genomic data security and privacy competition, iDash 2020 where it ranked third among 55 teams. We extend the proposed solution and experimented with two different machine learning algorithms and different settings. The experimental results show that it takes around 8 s to train a model while achieving 0.89 AUC with only a privacy budget of 5. The paper outlined a method to share gene expression data for Federated Learning using a privacy-preserving mechanism. Different experimental settings and recent competition results show the efficacy of the method which can be further extended to other genomic datasets and machine learning algorithms.
What problem does this paper attempt to address?