Prediction of the Disease Causal Genes Based on Heterogeneous Network and Multi-Feature Combination Method.
Lexiang Wang,Mingxiao Wu,Yulin Wu,Xiaofeng Zhang,Sen Li,Ming He,Fan Zhang,Yadong Wang,Junyi Li
DOI: https://doi.org/10.1016/j.compbiolchem.2022.107639
IF: 3.737
2022-01-01
Computational Biology and Chemistry
Abstract:At present, the prediction of disease causal genes is mainly based on heterogeneous. Research shows that heterogeneous network contains more information and have better prediction results. In this paper, we constructed a heterogeneous network including four node types of disease, gene, phenotype and gene ontology. On this basis, we use a machine learning algorithm to predict disease-causing genes. The algorithm is divided into three steps: preprocess and training sample extraction, features extraction and combination, model training and prediction. In the process of feature extraction and combination, by using network representation method, the representation vectors of nodes are generated as the embedding features of the nodes. We also extracted the structural features of each node in the network and then the embedding features and structure features are combined. The results of training and prediction show that the prediction algorithm based on all features combined together achieves the best prediction performance. Moreover, the combination of each network representation method's embedding features and structural features has also achieved performance improvement. In the process of training samples extraction, we propose three improvement directions according to the network structure and data set distribution. Firstly, a positive sample algorithm based on network connectivity is proposed, we try to keep the connectivity of the whole heterogeneous graph in the sampling process to avoid the negative impact of embedding features' extraction. Moreover, the influence of sample sampling ratio on experimental results was tested in the range of 0-1 with step size of 0.1. The influence of different proportion of positive and negative samples on the results was also tested. These improvements are intended to enhance the balance and robustness of the method. When the positive sample ratio is 0.1 and the proportion of negative and positive samples is 3, the model achieves the optimal result, and its AUC value and accuracy are 0.9887% and 94.55%, respectively, which are significantly higher than other models.