Building better genome annotations across the tree of life

Adam H Freedman,Timothy B Sackton
DOI: https://doi.org/10.1101/2024.04.12.589245
2024-05-21
Abstract:Recent technological advances in long read DNA sequencing accompanied by dramatic reduction in costs have made the production of genome assemblies financially achievable and computationally feasible, such that genome assembly no longer represents the major hurdle to evolutionary analysis for most non-model organisms. Now, the more difficult challenge is to properly annotate a draft genome assembly once it has been constructed. The primary challenge to annotation is how to select from the myriad gene prediction tools that are currently available, determine what kinds of data are necessary to generate high quality annotations, and evaluate the quality of the annotation. To determine which methods perform the best and determine whether the inclusion of RNA-seq data is necessary to obtain a high-quality annotation, we generated annotations with 10 different methods for 21 different species spanning vertebrates, plants, and insects. We found that the RNA-seq assembler Stringtie and the annotation transfer method TOGA were consistently top performers across a variety of metrics including BUSCO recovery, CDS length, and false positive rate, with the exception that TOGA performed less in plants with larger genomes. RNA-seq alignment rate was best with RNA-seq assemblers. HMM-based methods such as BRAKER, MAKER, and multi-genome AUGUSTUS mostly underperformed relative to Stringtie and TOGA. In general, inclusion of RNA-seq data will lead to substantial improvements to genome annotations, and there may be cases where complementarity among methods may motivate combining annotations from multiple sources.
Biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively perform gene annotation after genome assembly is completed. With the progress of long - read - length DNA sequencing technology and the significant reduction in cost, genome assembly is no longer the main obstacle for most non - model organisms. However, how to correctly annotate a draft genome assembly has become a more difficult challenge. Specifically, the paper focuses on the following aspects: 1. **Selecting appropriate gene prediction tools**: There are currently multiple gene prediction tools, and it is a difficult problem to select the most suitable tool from them. 2. **Determining the data types required to generate high - quality annotations**: Besides genomic sequences, are other types of experimental data such as RNA - seq data needed to improve the annotation quality? 3. **Evaluating annotation quality**: How to evaluate the quality of annotations generated by different methods, including the accuracy, completeness and specificity of gene prediction. To answer these questions, the author used 10 different methods to perform gene annotation on 21 species (including vertebrates, plants and insects), and evaluated the performance of these methods through multiple indicators (such as BUSCO recovery rate, CDS length, false positive rate, etc.). The main goal of the study was to determine which methods perform best on various sensitivity and specificity indicators, and whether RNA - seq data is crucial for generating high - quality annotations.