The Ribosomal Operon Database (ROD): A full-length rDNA operon database derived from genome assemblies

Anders K. Krabberød,Embla Stokke,Ella Thoen,Inger Skrede,Håvard Kauserud
DOI: https://doi.org/10.1101/2024.04.19.590225
2024-04-30
Abstract:Current rDNA reference sequence databases are tailored towards shorter DNA markers, such as parts of the 16/18S marker or the ITS region. However, due to advances in long-read DNA sequencing technologies, longer stretches of the rDNA operon are increasingly used in environmental sequencing studies to increase the phylogenetic resolution. There is, therefore, a growing need for longer rDNA reference sequences. Here, we present the Ribosomal Operon Database (ROD), which includes eukaryotic full-length rDNA operons fished from publicly available genome assemblies. Full-length operons were detected in 34.1% of the 34,701 examined eukaryotic genome assemblies from NCBI. In most cases (53.1%), more than one operon variant was detected, which can be due to intragenomic operon copy variability, allelic variation in non-haploid genomes, or technical errors from the sequencing and assembly process. The highest copy number found was 5,947 in . In total, 453,697 unique operons were detected, with 69,480 operon variant clusters remaining after intragenomic clustering at 99% sequence identity. The operon length varied extensively across eukaryotes, ranging from 4,136 to 16,463 bp, which will lead to considerable PCR bias during amplification of the entire operon. Clustering the full-length operons revealed that the different parts (i.e., 18S, 28S, the hypervariable region V4 of 18S, and ITS) provide divergent taxonomic resolution, with 18S and the V4 region being the most conserved. The Ribosomal Operon Database (ROD) will be updated regularly to provide an increasing number of full-length rDNA operons to the scientific community.
Genomics
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Insufficient existing databases**: Current rDNA reference sequence databases mainly target shorter DNA markers (such as partial regions of 16S/18S or ITS regions). With the development of long-read sequencing technologies, environmental sequencing studies require longer rDNA reference sequences to improve phylogenetic resolution. 2. **Establishing a comprehensive full-length rDNA database**: To meet this need, researchers have constructed the Ribosomal Operon Database (ROD), which includes full-length rDNA operon sequences of eukaryotes extracted from public genome assemblies. 3. **Exploring the variability and length changes of rDNA operons**: The paper provides a detailed analysis of the variability of rDNA operons in different eukaryotes, including copy number, length differences, and molecular variation levels. It also discusses how these factors affect phylogenetic analysis and DNA metabarcoding studies. In summary, the paper fills the gap of long-read rDNA reference sequences by constructing the ROD database and provides an in-depth understanding of the diversity of rDNA operons in different eukaryotes and their potential applications.