Abstract:More than ten thousand coding variants are contained in each human genome; however, our knowledge of the way genetic variants underlie phenotypic differences is far from complete. Small insertions and deletions (indels) are one of the most common types of human genetic variants, and indels play a significant role in human inherited disease. To date, we still lack a comprehensive understanding of how indels cause diseases. Therefore, identification and analysis of such deleterious variants is a key challenge and has been of great interest in the current research in genome biology. Increasing numbers of computational methods have been developed for discriminating between deleterious indels and neutral indels. However, most of the existing methods are based on traditional sequential or structural features, which cannot completely explain the association between indels and the resulting induced inherited disease. In this study, we establish a novel method to predict deleterious non-frameshifting indels based on features extracted from both protein interaction networks and traditional hybrid properties. Each indel was coded by 1,246 features. Using the maximum relevance minimum redundancy method and the incremental feature selection method, we obtained an optimal feature set containing 42 features, of which 21 features were derived from protein interaction networks. Based on the optimal feature set, an 88 % accuracy and a 0.76 MCC value were achieved by a Random Forest as evaluated by the Jackknife cross-validation test. This method outperformed existing methods of predicting deleterious indels, and can be applied in practice for deleterious non-frameshifting indel predictions in genome research. The analysis of the optimal features selected in the model revealed that network interactions play more important roles and could be informative for better illustrating an indel’s function and disease associations than traditional sequential or structural features. These results could shed some light on the genetic basis of human genetic variations and human inherited diseases.

Combination Use of Protein–protein Interaction Network Topological Features Improves the Predictive Scores of Deleterious Non-Synonymous Single-Nucleotide Polymorphisms

Protein-protein Interaction Network with Machine Learning Models and Multiomics Data Reveal Potential Neurodegenerative Disease-Related Proteins

Prediction of Deleterious Non-Synonymous SNPs Based on Protein Interaction Network and Hybrid Properties

Predicting Disease-Associated Substitution of a Single Amino Acid by Analyzing Residue Interactions

Predicting Pathogenic Single Nucleotide Variants Through a Comprehensive Analysis on Multiple Level Features

A Novel Predictor for Disease-Genes Based on Combination Use of Topological Features in Human Protein-Protein Interaction Network

Accurate Sequence-Based Prediction of Deleterious Nssnps with Multiple Sequence Profiles and Putative Binding Residues

NIPS, a 3D Network-Integrated Predictor of Deleterious Protein SAPs, and Its Application in Cancer Prognosis.

Discriminating Between Deleterious and Neutral Non-Frameshifting Indels Based on Protein Interaction Networks and Hybrid Properties

Enhancing Cancer Driver Gene Prediction by Protein-Protein Interaction Network

Inferring Non-Synonymous Single-Nucleotide Polymorphisms-Disease Associations Via Integration of Multiple Similarity Networks

Prediction of Deleterious Nonsynonymous Single-Nucleotide Polymorphism for Human Diseases

DAMpred: Recognizing Disease-Associated Nssnps Through Bayes-Guided Neural-Network Model Built on Low-Resolution Structure Prediction of Proteins and Protein-Protein Interactions.

Predicting Deleterious Non-Synonymous Single Nucleotide Polymorphisms in Signal Peptides Based on Hybrid Sequence Attributes

Improved Feature-Based Prediction of SNPs in Human Cytochrome P450 Enzymes.

Predicting Disease Genes Based On Normalized Protein Modules And Phenotype Ontology

gnizing Disease-Associated Bayes-Guided Neural-uilt on Low-Resolution tion of Proteins and Interactions

Prediction of Disease-Associated Nssnps by Integrating Multi-Scale ResNet Models with Deep Feature Fusion.

Computational identification of deleterious synonymous variants in human genomes using a feature-based approach

A Computational Method Based On The Integration Of Heterogeneous Networks For Predicting Disease-Gene Associations

Partition Dataset According to Amino Acid Type Improves the Prediction of Deleterious Non-Synonymous SNPs