User-friendly genome assembly and gene annotation pipelines for vertebrates

Siwen Wu,Sisi Yuan,Zhengchang Su
DOI: https://doi.org/10.1101/2024.05.21.595213
2024-05-23
Abstract:High-quality genome assembly and accurate gene annotation are critical to understand the biology and evolutionary history of a species. With the advance in sequencing technologies, assembling large vertebrate genomes using a combination of low-costing short reads and noisy long reads is in the reach of an individual lab. Although many tools have been developed, no single one is able to produce a chromosome-level assembly. Instead, a pipeline involving multiple steps using different tools is needed to achieve the goal. Nonetheless, it is difficult for a newcomer to choose an appropriate tool for each stage of pipeline and their combination for optimal performance. Moreover, most existing gene annotation tools cannot sufficiently and accurately identify both protein-coding genes and pseudogenes. Although gene annotation pipelines have been described such as those by NCBI and ENSEMBL, they are not available for individual labs due to the complexity of their use and required computational resources. To overcome these obstacles, here, we introduce user-friendly optimized genome assembly and gene annotation pipelines for vertebrates. Our genome assembly pipeline is able to produce chromosome-level assemblies with high quality using a combination of PacBio/Nanopore long reads, Illumina short reads and Hi-C short reads, and our gene annotation pipeline can accurately and sufficiently annotate both protein-coding genes and pseudogenes in an assembled genome using a combination of homology-based and RNA-seq data-based methods. Both pipelines show superior performances to existing ones in applications.
Bioinformatics
What problem does this paper attempt to address?