GeneMarkS-2 : Raising Standards of Accuracy in Gene Recognition

A. Lomsadze,M. Borodovsky,Karl Gemayel,Shiyuyun Tang
Abstract:Motivation: Ab initio gene prediction in prokaryotic genomes is supposed to be so accurate that RNASeq data are rarely produced to bring in an additional layer of evidence. In 2016 more than 60,000 prokaryotic genomes were re-annotated by the NCBI pipeline. Given the sheer volume of prokaryotic DNA data flowing from next generation sequencing facilities into genome databases, the annotation accuracy should be at the highest level possible. Still, the prevalence of horizontal gene transfer as well as ubiquitous leaderless transcription observed in prokaryotic species call for introducing more complex models of genes and regulatory regions than it was thought to be sufficient earlier. Results: We describe a new algorithm and software tool GeneMarkS-2. The new multi-model tool has an option to select parameters best matching local genomic GC content that may vary widely due to horizontal gene transfer. Genomes are automatically classified by the inferred types of organization of gene starts neighborhoods which evolution is directed by species specific transcription and translation mechanisms. A new motif search algorithm, LFinder, introduced to reach higher accuracy in detecting conserved motifs in regulatory regions upstream to predicted gene starts uses objective function depending on motif localization. In performance assessments made on test sets validated by proteomics experiments and other sources of evidence we have demonstrated superior accuracy of GeneMarkS-2 in comparison with other state-of-the-art gene prediction tools including GeneMarkS which “plus” version is currently used by the NCBI prokaryotic genome annotation pipeline. Availability: http://topaz.gatech.edu/GeneMark/genemarks2.cgi Contact: borodovsky@gatech.edu
Biology,Computer Science
What problem does this paper attempt to address?