A network-based machine-learning framework to identify both functional modules and disease genes

Kuo Yang,Kezhi Lu,Yang Wu,Jian Yu,Baoyan Liu,Yi Zhao,Jianxin Chen,Xuezhong Zhou
DOI: https://doi.org/10.1007/s00439-020-02253-0
2021-01-07
Human Genetics
Abstract:Disease gene identification is a critical step towards uncovering the molecular mechanisms of diseases and systematically investigating complex disease phenotypes. Despite considerable efforts to develop powerful computing methods, candidate gene identification remains a severe challenge owing to the connectivity of an incomplete interactome network, which hampers the discovery of true novel candidate genes. We developed a network-based machine-learning framework to identify both functional modules and disease candidate genes. In this framework, we designed a semi-supervised non-negative matrix factorization model to obtain the functional modules related to the diseases and genes. Of note, we proposed a disease gene-prioritizing method called MapGene that integrates the correlations from both functional modules and network closeness. Our framework identified a set of functional modules with highly functional homogeneity and close gene interactions. Experiments on a large-scale benchmark dataset showed that MapGene performs significantly better than the state-of-the-art algorithms. Further analysis demonstrates MapGene can effectively relieve the impact of the incompleteness of interactome networks and obtain highly reliable rankings of candidate genes. In addition, disease cases on Parkinson's disease and diabetes mellitus confirmed the generalization of MapGene for novel candidate gene identification. This work proposed, for the first time, an integrated computing framework to predict both functional modules and disease candidate genes. The methodology and results support that our framework has the potential to help discover underlying functional modules and reliable candidate genes in human disease.
genetics & heredity
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to identify disease - related genes and functional modules more accurately**. Specifically, the author has developed a network - based machine - learning framework, aiming to overcome the challenges encountered by existing methods in dealing with incomplete interaction networks and improve the accuracy of candidate gene prediction. ### Problem Background 1. **Importance of Disease Gene Identification** - Disease gene identification is crucial for revealing the molecular mechanisms of diseases and is helpful for systematic research on complex disease phenotypes. - Although many powerful computational methods have been proposed, due to the incompleteness of the interactome network, the identification of candidate genes remains a severe challenge. 2. **Limitations of Existing Methods** - Existing methods such as network propagation algorithms, clustering or classification algorithms perform well in dealing with high - connectivity and known proteins, but perform poorly in identifying new candidate genes. - Although network embedding methods have made some progress, most of them rely on the connectivity of the interaction network and existing phenotype - genotype association data, making it difficult to discover truly new genes. ### Solutions Proposed in the Paper 1. **Develop a Network - Based Machine - Learning Framework** - This framework combines semi - supervised non - negative matrix factorization (NMF) and network proximity evaluation to identify functional modules and disease candidate genes. - A disease gene prioritization method named MapGene is proposed, which integrates the relevance of functional modules and network proximity. 2. **Specific Implementation Steps** - **Data Collection and Pre - processing**: Collect known disease - gene associations and protein - protein interaction data, and construct an association matrix and a PPI network. - **Learn the Embedded Features of Diseases and Genes**: Obtain the embedded feature matrices \( D \) and \( G \) of diseases and genes through the semi - supervised NMF model, and use these matrices to identify functional modules. - **Functional Module Verification**: Verify the effectiveness of functional modules through indicators such as functional homogeneity and network proximity. - **MapGene Algorithm**: Combine the relevance of functional modules and network proximity to predict candidate disease genes. 3. **Experimental Results** - Experiments on large - scale benchmark datasets show that MapGene significantly outperforms the existing state - of - the - art algorithms in performance, especially when the TOP@K value is small. - Case studies show that MapGene can effectively alleviate the impact of incomplete interaction networks and obtain reliable candidate gene rankings. ### Summary This paper proposes a new computational framework that predicts functional modules and disease candidate genes simultaneously for the first time, solving the limitations of existing methods in dealing with incomplete interaction networks. This method performs well on multiple evaluation indicators and has broad application prospects, especially demonstrating its effectiveness in the research of complex diseases such as Parkinson's disease and diabetes.