Discriminate the falsely predicted protein-coding genes in Aeropyrum Pernix K1 genome based on graphical representation

Jiafeng Yu,Dongke Jiang,Ke Xiao,Yun Jin,Jihua Wang,Xiao Sun
2012-01-01
Abstract:The problem that how many protein-coding genes exist in Aeropyrum pernix K1 genome has confused many scientists since 1999. In this paper, we attempt to re-identify the protein-coding genes in this genome by proposing a modified method based on I-TN curve. Consequently, all of the 727 experimentally validated protein-coding genes and 726 of the corresponding negative samples are correctly predicted respectively, then an accuracy of 99.93% of self-test is obtained. In the Jackknife test, two positive samples and two negative samples are falsely predicted, respectively, and then the accuracy of cross-validation is 99.72%. In the testing set, all of the 132 putative genes are correctly predicted as protein-coding and 14 out of the 841 hypothetical genes are predicted as non-coding, the number of protein-coding genes is reduced to 1686 instead of 1700. Further analysis shows the performance of the reannotating algorithm is comparable to other prevalent programs, and the present method is much simple and efficient. We implement the reannotating algorithm trained by Aeropyrum pernix K1 to Chlorobium tepidum TLS genome, and 217 hypothetical genes are predicted as non-coding. Sufficient sequences analysis indicates most of them are random sequences that are falsely predicted as protein-coding genes. In addition, we also perform some significative analysis aiming to the influence of artificial parameters on the graphical representation approaches, which may provide helpful information for related researches.
What problem does this paper attempt to address?