Building better genome annotations across the tree of life

Adam H Freedman,Timothy B Sackton

DOI: https://doi.org/10.1101/2024.04.12.589245

2024-05-21

Abstract:Recent technological advances in long read DNA sequencing accompanied by dramatic reduction in costs have made the production of genome assemblies financially achievable and computationally feasible, such that genome assembly no longer represents the major hurdle to evolutionary analysis for most non-model organisms. Now, the more difficult challenge is to properly annotate a draft genome assembly once it has been constructed. The primary challenge to annotation is how to select from the myriad gene prediction tools that are currently available, determine what kinds of data are necessary to generate high quality annotations, and evaluate the quality of the annotation. To determine which methods perform the best and determine whether the inclusion of RNA-seq data is necessary to obtain a high-quality annotation, we generated annotations with 10 different methods for 21 different species spanning vertebrates, plants, and insects. We found that the RNA-seq assembler Stringtie and the annotation transfer method TOGA were consistently top performers across a variety of metrics including BUSCO recovery, CDS length, and false positive rate, with the exception that TOGA performed less in plants with larger genomes. RNA-seq alignment rate was best with RNA-seq assemblers. HMM-based methods such as BRAKER, MAKER, and multi-genome AUGUSTUS mostly underperformed relative to Stringtie and TOGA. In general, inclusion of RNA-seq data will lead to substantial improvements to genome annotations, and there may be cases where complementarity among methods may motivate combining annotations from multiple sources.

Biology

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively perform gene annotation after genome assembly is completed. With the progress of long - read - length DNA sequencing technology and the significant reduction in cost, genome assembly is no longer the main obstacle for most non - model organisms. However, how to correctly annotate a draft genome assembly has become a more difficult challenge. Specifically, the paper focuses on the following aspects: 1. **Selecting appropriate gene prediction tools**: There are currently multiple gene prediction tools, and it is a difficult problem to select the most suitable tool from them. 2. **Determining the data types required to generate high - quality annotations**: Besides genomic sequences, are other types of experimental data such as RNA - seq data needed to improve the annotation quality? 3. **Evaluating annotation quality**: How to evaluate the quality of annotations generated by different methods, including the accuracy, completeness and specificity of gene prediction. To answer these questions, the author used 10 different methods to perform gene annotation on 21 species (including vertebrates, plants and insects), and evaluated the performance of these methods through multiple indicators (such as BUSCO recovery rate, CDS length, false positive rate, etc.). The main goal of the study was to determine which methods perform best on various sensitivity and specificity indicators, and whether RNA - seq data is crucial for generating high - quality annotations.

Building better genome annotations across the tree of life

OMAnnotator: a novel approach to building an annotated consensus genome sequence

User-friendly genome assembly and gene annotation pipelines for vertebrates

Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes

Characterizing and Annotating the Genome Using RNA-seq Data

High-throughput sequencing data and the impact of plant gene annotation quality

Integrating gene annotation with orthology inference at scale

Combining independent de novo assemblies optimizes the coding transcriptome for nonconventional model eukaryotic organisms

Combining DNA and protein alignments to improve genome annotation with LiftOn

Modern tools for annotation of small genomes of non-model eukaryotes

The Utilization of Reference-Guided Assembly and In Silico Libraries Improves the Draft Genome of Clarias batrachus and Culter alburnus

The changing face of genome assemblies: Guidance on achieving high‐quality reference genomes

Complement Genome Annotation Lift over Using a Weighted Sequence Alignment Strategy.

Significant association of HLA-DQ5 with autoimmune hepatitis in Taiwan.

FLAG: Find, Label Annotate Genomes, a fully automated tool for genome gene structural and functional annotation of highly fragmented non-model species

From tradition to innovation: conventional and deep learning frameworks in genome annotation

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Benchmarking multi-platform sequencing technologies for human genome assembly

BAUM: A DNA Assembler by Adaptive Unique Mapping and Local Overlap-Layout-Consensus

Illuminating the dark side of the human transcriptome with long read transcript sequencing

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification