Abstract:Background: Pinpointing genes involved in inherited human diseases remains a great challenge in the postgenomics era. Although approaches have been proposed either based on the guilt-by-association principle or making use of disease phenotype similarities, the low coverage of both diseases and genes in existing methods has been preventing the scan of causative genes for a significant proportion of diseases at the whole-genome level.Results: To overcome this limitation, we proposed a rigorous statistical method called pgFusion to prioritize candidate genes by integrating one type of disease phenotype similarity derived from the Unified Medical Language System (UMLS) and seven types of gene functional similarities calculated from gene expression, gene ontology, pathway membership, protein sequence, protein domain, protein-protein interaction and regulation pattern, respectively. Our method covered a total of 7,719 diseases and 20,327 genes, achieving the highest coverage thus far for both diseases and genes. We performed leave-one-out cross-validation experiments to demonstrate the superior performance of our method and applied it to a real exome sequencing dataset of epileptic encephalopathies, showing the capability of this approach in finding causative genes for complex diseases. We further provided the standalone software and online services of pgFusion at http://bioinfo.au.tsinghua.edu.cn/jianglab/pgfusion.Conclusions: pgFusion not only provided an effective way for prioritizing candidate genes, but also demonstrated feasible solutions to two fundamental questions in the analysis of big genomic data: the comparability of heterogeneous data and the integration of multiple types of data. Applications of this method in exome or whole genome sequencing studies would accelerate the finding of causative genes for human diseases. Other research fields in genomics could also benefit from the incorporation of our data fusion methodology.

Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion.

Towards Prediction and Prioritization of Disease Genes by the Modularity of Human Phenome-Genome Assembled Network.

A Computational Method Based On The Integration Of Heterogeneous Networks For Predicting Disease-Gene Associations

Predicting disease genes based on multi-head attention fusion

Disease Gene Prediction by Integrating PPI Networks, Clinical RNA-Seq Data and OMIM Data

Multipath2vec: Predicting Pathogenic Genes Via Heterogeneous Network Embedding

A Novel Disease Gene Prediction Method Based on Ppi Network

Pinpointing Disease Genes Through Phenomic and Genomic Data Fusion

A Fast and High Performance Multiple Data Integration Algorithm for Identifying Human Disease Genes

Disease Gene Prediction Based on Heterogeneous Probabilistic Hypergraph Ranking.

Prioritization of candidate disease genes by combining topological similarity and semantic similarity

Integrating Multiple Protein-Protein Interaction Networks to Prioritize Disease Genes: a Bayesian Regression Approach

Deep Collaborative Filtering for Prediction of Disease Genes

Probability-based collaborative filtering model for predicting gene–disease associations

Enhancing Cancer Driver Gene Prediction by Protein-Protein Interaction Network

Prioritizing Disease Genes by Using Search Engine Algorithm

A protein-phenotype mutual information based identification of human disease genes

Disease gene prediction with privileged information and heteroscedastic dropout

Predicting Disease Genes Based On Normalized Protein Modules And Phenotype Ontology

Prediction and Validation of Disease Genes Using Hetesim Scores

A network embedding model for pathogenic genes prediction by multi-path random walking on heterogeneous network