DeepnsSNPs: Accurate Prediction of Non-Synonymous Single-Nucleotide Polymorphisms by Combining Multi-Scale Convolutional Neural Network and Residue Environment Information

Fang Ge,Arif Muhammad,Dong-Jun Yu
DOI: https://doi.org/10.1016/j.chemolab.2021.104326
IF: 4.175
2021-01-01
Chemometrics and Intelligent Laboratory Systems
Abstract:Non-synonymous single-nucleotide polymorphisms (nsSNPs) is a typical kind of genetic variant, and more than 6000 diseases have been detected to be caused by nsSNPs. Accordingly, the accurate prediction of nsSNPs is of great importance for a better understanding of their functional mechanisms and disease treatment. Till now, many computational studies have been developed to identify disease-causing nsSNPs from the neutral ones; however, there is still some gap existing for further improvement in terms of overall prediction performance. In this work, we proposed a novel deep learning model, called multi-scale convolutional neural network (MSCNN). It utilized multi-scale convolution with different kernel sizes for feature processing, which can collect more effective characteristics than using a single convolution kernel size. Moreover, we applied three types of nominal structural features for further improving the nsSNPs prediction performance. Notably, the nsSNPs sequence and structural features were extracted based on the "residue environment" method we proposed, which has proved to be effective for protein nsSNPs prediction in our previous research. Based on the proposed MSCNN model and the extracted informative feature matrix, we implemented a new nsSNPs predictor, named DeepnsSNPs. The DeepnsSNPs was tested on three nsSNPs datasets collected from the PredictSNP1 website and achieved an average Matthews correlation coefficient of 0.507, which is 18.28% higher than the individual classifiers and 11.37% higher than the consensus classifier on average. Detailed dataset analyses have demonstrated that the DeepnsSNPs would be useful in the nsSNPs prediction. We provide the source python codes and benchmark datasets at htt ps://github.com/sera616/DeepnsSNPs.git for academic use.
What problem does this paper attempt to address?