Abstract:Background: Although synonymous single nucleotide variants (sSNVs) do not alter the protein sequences, they have been shown to play an important role in human disease. Distinguishing pathogenic sSNVs from neutral ones is challenging because pathogenic sSNVs tend to have low prevalence. Although many methods have been developed for predicting the functional impact of single nucleotide variants, only a few have been specifically designed for identifying pathogenic sSNVs. Results: In this work, we describe a computational model, IDSV (Identification of Deleterious Synonymous Variants), which uses random forest (RF) to detect deleterious sSNVs in human genomes. We systematically investigate a total of 74 multifaceted features across seven categories: splicing, conservation, codon usage, sequence, pre-mRNA folding energy, translation efficiency, and function regions annotation features. Then, to remove redundant and irrelevant features and improve the prediction performance, feature selection is employed using the sequential backward selection method. Based on the optimized 10 features, a RF classifier is developed to identify deleterious sSNVs. The results on benchmark datasets show that IDSV outperforms other state-of-the-art methods in identifying sSNVs that are pathogenic. Conclusions: We have developed an efficient feature-based prediction approach (IDSV) for deleterious sSNVs by using a wide variety of features. Among all the features, a compact and useful feature subset that has an important implication for identifying deleterious sSNVs is identified. Our results indicate that besides splicing and conservation features, a new translation efficiency feature is also an informative feature for identifying deleterious sSNVs. While the function regions annotation and sequence features are weakly informative, they may have the ability to discriminate deleterious sSNVs from benign ones when combined with other features. The data and source code are available on website http://bioinfo.ahu.edu.cn:8080/IDSV .

Identification of DNase I Hypersensitive Sites in the Human Genome by Multiple Sequence Descriptors

The prediction of human DNase I hypersensitive sites based on DNA sequence information

iDHS-Deep: an integrated tool for predicting DNase I hypersensitive sites by deep neural network

iDHS-RGME: Identification of DNase I hypersensitive sites by integrating information on nucleotide composition and physicochemical properties

Genome-wide detection of DNase I hypersensitive sites in single cells and FFPE tissue samples

DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest

Genome-wide Nucleosome Detection Based on the Dinucleotide Position Frequencies

A Practical Guide for DNase-seq Data Analysis: from Data Management to Common Applications

Using DNase digestion data to accurately identify transcription factor binding sites.

An Improved Method for Identifying Specific DNA-Protein-Binding Sites in Vitro.

Atlas and developmental dynamics of mouse DNase I hypersensitive sites

Genome-Wide Identification of Regulatory Sequences Undergoing Accelerated Evolution in the Human Genome.

Understanding Transcription Factor Regulation by Integrating Gene Expression and DNase I Hypersensitive Sites

High-resolution genome-wide functional dissection of transcriptional regulatory regions and nucleotides in human

An Integrative Analysis of TFBS-clustered Regions Reveals New Transcriptional Regulation Models on the Accessible Chromatin Landscape

DNase I–hypersensitive exons colocalize with promoters and distal regulatory elements

An improved DNA-binding hot spot residues prediction method by exploring interfacial neighbor properties

Genome-wide Identification of Regulatory Sequences Undergoing Accelerated Evolution in the Human Genome

Computational identification of deleterious synonymous variants in human genomes using a feature-based approach

Systematical analyses of variants in DNase I hypersensitive sites to identify hepatocellular carcinoma susceptibility loci in a Chinese population

Combining Hi-C data with phylogenetic correlation to predict the target genes of distal regulatory elements in human genome