DDIG-in: Detecting Disease-Causing Genetic Variations Due to Frameshifting Indels and Nonsense Mutations Employing Sequence and Structural Properties at Nucleotide and Protein Levels.

Lukas Folkman,Yuedong Yang,Zhixiu Li,Bela Stantic,Abdul Sattar,Matthew Mort,David N. Cooper,Yunlong Liu,Yaoqi Zhou
DOI: https://doi.org/10.1093/bioinformatics/btu862
IF: 5.8
2015-01-01
Bioinformatics
Abstract:MOTIVATIONFrameshifting (FS) indels and nonsense (NS) variants disrupt the protein-coding sequence downstream of the mutation site by changing the reading frame or introducing a premature termination codon, respectively. Despite such drastic changes to the protein sequence, FS indels and NS variants have been discovered in healthy individuals. How to discriminate disease-causing from neutral FS indels and NS variants is an understudied problem.RESULTSWe have built a machine learning method called DDIG-in (FS) based on real human genetic variations from the Human Gene Mutation Database (inherited disease-causing) and the 1000 Genomes Project (GP) (putatively neutral). The method incorporates both sequence and predicted structural features and yields a robust performance by 10-fold cross-validation and independent tests on both FS indels and NS variants. We showed that human-derived NS variants and FS indels derived from animal orthologs can be effectively employed for independent testing of our method trained on human-derived FS indels. DDIG-in (FS) achieves a Matthews correlation coefficient (MCC) of 0.59, a sensitivity of 86%, and a specificity of 72% for FS indels. Application of DDIG-in (FS) to NS variants yields essentially the same performance (MCC of 0.43) as a method that was specifically trained for NS variants. DDIG-in (FS) was shown to make a significant improvement over existing techniques.
What problem does this paper attempt to address?