In Silico Prediction of Mrna Poly(a) Sites in Chlamydomonas Reinhardtii

Xiaohui Wu,Guoli Ji,Yong Zeng
DOI: https://doi.org/10.1007/s00438-012-0725-5
IF: 2.98
2012-01-01
Molecular Genetics and Genomics
Abstract:Accurately predicting polyadenylation [poly(A)] sites is important for defining the end of genes and understanding gene regulation mechanisms. Alternative polyadenylation (APA) has been demonstrated to play an important role in transcriptome diversity and regulating gene expression. To accurately predict poly(A) and APA sites in Chlamydomonas reinhardtii, a green alga that can be used to produce renewable energy, we proposed a novel model that integrated five methods for representing the features of these sites with a combined classifier. We presented a new grouping method based on pattern assembly to classify the poly(A) sites into four groups. We used five methods, involving the predicted RNA secondary structure, the term frequency–inverse document frequency weight, first-order Markov chain, pentamer ratio and a position weight matrix, to generate the feature space. We then developed a heuristic method to form the combined classifier by weighting multiple classifiers to predict poly(A) sites in each group. The high specificity and sensitivity of this model were demonstrated by testing the four groups of poly(A) sites and the intronic APA sites. The average prediction performance was approximately 8 % higher than the performance of a previous prediction model. For the group without any conserved patterns, the prediction accuracy was 9 % higher than for the accuracy with the previous technique. However, the prediction efficiency of this group was still significantly lower than that of the other groups, indicating the importance of identifying additional signal patterns for poly(A) site prediction. We also predicted the alternative poly(A) sites in introns with good accuracy. This prediction model was designed to be easily expanded with new classifiers or new features. Therefore, this model is applicable to new data or other species. Our model will be useful both in genome annotation because it predicts the end of a mature transcript and in genetic engineering because it enables researchers to eliminate undesirable poly(A) sites.
What problem does this paper attempt to address?