Chapter 2 Prediction of Plant mRNA Polyadenylation Sites

Xiaohui Wu,Guoli Ji,Qingshun Quinn Li
2018-01-01
Abstract:Messenger RNA polyadenylation is one of the essential processing steps during eukaryotic gene expression. The site of polyadenylation [poly(A) site] marks the end of a transcript, which is also the end of a gene in most cases. A computation program that is able to recognize poly(A) sites would not only be useful for genome annotation in fi nding genes ends, but also for predicting alternative poly(A) sites. PASS [ P oly( A ) S ite S leuth] and PAC [ P oly( A ) site C lassifi er] were developed to predict poly(A) sites in plants. PASS was built based on the Generalized Hidden Markov Model (GHMM), which consists of four functional modules: input model, poly(A) site recognition module, graphic process module, and output module. PAC is a classifi cation model, integrating several features that defi ne the poly(A) sites including K -gram pattern, Z-curve, position-specifi c scoring matrix, and fi rst-order inhomogeneous Markov sub-model. PAC can be used to predict poly(A) sites from species whose polyadenylation profi le is unknown. The result of PASS and PAC is an output of a few fi les with one of them containing the score or probability of being a poly(A) site for each position of a given sequence. While the models were built mostly based on poly(A) profi le data from Arabidopsis, it is also functional in other higher plants since their profi les are quite similar.
What problem does this paper attempt to address?