BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

Felipe A. Simão,Robert M. Waterhouse,Panagiotis Ioannidis,Evgenia V. Kriventseva,Evgeny M. Zdobnov
DOI: https://doi.org/10.1093/bioinformatics/btv351
IF: 5.8
2015-06-09
Bioinformatics
Abstract:MOTIVATION: Genomics has revolutionized biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50.RESULTS: We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. We implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO.AVAILABILITY AND IMPLEMENTATION: Software implemented in Python and datasets available for download from http://busco.ezlab.org.CONTACT: evgeny.zdobnov@unige.chSUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the evaluation of the integrity of genome assembly and annotation. Specifically, the existing quality assessment methods mainly rely on technical indicators such as N50. Although these indicators can reflect the continuity of assembled fragments, they cannot comprehensively evaluate the integrity of the genome in terms of gene content. Therefore, the author proposes a method based on single - copy orthologs (SCO) to quantitatively evaluate the integrity of genome assembly and annotation. This method utilizes the knowledge of evolutionary biology, that is, certain genes are expected to be single - copy in specific species. By detecting the presence or absence and integrity of these genes, the quality of genome data can be more accurately evaluated. To achieve this goal, the author has developed a tool named BUSCO (Benchmarking Universal Single - Copy Orthologs) and constructed multiple SCO datasets applicable to different biological taxa. These datasets can be used not only to evaluate the quality of genome assembly, but also to evaluate the integrity of the annotated gene set and the transcriptome. In this way, BUSCO provides an intuitive and high - resolution quantification method to evaluate the integrity of rapidly accumulating genome data.