Identification of 5 ' Utr Splicing Site Using Sequence and Structural Specificities Based on Combination Statistical Method with Svm

Lv Jun-Jie,Wang Ke-Jun,Feng Wei-Xing,Wang Xin,Xiong Xin-Yan
DOI: https://doi.org/10.12785/amis/071l14
2013-01-01
Applied Mathematics & Information Sciences
Abstract:To identify untranslated regions (UTR) splice sites more accurately and efficiently, a method for the recognition of UTR splice sites using both splicing sequences and secondary structures of flank sequence information based on combination statistical method with support vector machine was proposed. The method consists of two stages: a statistical method is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The statistical method serves as a pre-processing step for the SVM and takes UTR sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. Then the Mfold package in Vienna soft was used to predict the most stable secondary structure of flank sequences. The traditional four-letter alphabet was converted into eight-letter alphabet sequence. The sequence- structure combination strings were used for training models then recognized splice sites by the well trained models. Using the actual 5'UTR splice dataset of human gene tested the method; it shows a good performance for UTR splice sites recognition.
What problem does this paper attempt to address?