Author Response: Highly Contiguous Assemblies of 101 Drosophilid Genomes
Bernard Kim,Jeremy Wang,Danny E. Miller,Olga Barmina,Emily Delaney,Ammon Thompson,Aaron A. Comeault,David Peede,Emmanuel R. R. D’Agostino,Julianne Peláez,Jessica M Aguilar,Diler Haji,Teruyuki Matsunaga,Ellie E. Armstrong,Molly Zych,Yoshitaka Ogawa,Marina Stamenković‐Radak,Mihailo Jelić,Marija Savić Veselinović,Marija Tanasković,Pavle Erić,Jian‐Jun Gao,Takehiro K. Katoh,Masanori J. Toda,Hideaki Watabe,Masayoshi Watada,Jeremy S Davis,Leonie C. Moyle,Giulia Manoli,Enrico Bertolini,Vladimı́r Košťál,R. Scott Hawley,Aya Takahashi,Corbin D. Jones,Daniel O. Price,Noah K. Whiteman,Artyom Kopp,Daniel R. Matute,Dmitri A. Petrov
DOI: https://doi.org/10.7554/elife.66405.sa2
2021-01-01
Abstract:Article Figures and data Abstract Introduction Results and discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Over 100 years of studies in Drosophila melanogaster and related species in the genus Drosophila have facilitated key discoveries in genetics, genomics, and evolution. While high-quality genome assemblies exist for several species in this group, they only encompass a small fraction of the genus. Recent advances in long-read sequencing allow high-quality genome assemblies for tens or even hundreds of species to be efficiently generated. Here, we utilize Oxford Nanopore sequencing to build an open community resource of genome assemblies for 101 lines of 93 drosophilid species encompassing 14 species groups and 35 sub-groups. The genomes are highly contiguous and complete, with an average contig N50 of 10.5 Mb and greater than 97% BUSCO completeness in 97/101 assemblies. We show that Nanopore-based assemblies are highly accurate in coding regions, particularly with respect to coding insertions and deletions. These assemblies, along with a detailed laboratory protocol and assembly pipelines, are released as a public resource and will serve as a starting point for addressing broad questions of genetics, ecology, and evolution at the scale of hundreds of species. Introduction The rise of long-read sequencing alongside the continuously decreasing costs of next-generation sequencing have served to greatly democratize the process of genome assembly, making it feasible to assemble high-quality genomes at a previously unthinkable scale. Currently, a number of large consortia are leading well-publicized efforts to assemble the genomes of many taxa throughout the Tree of Life. Some often overlapping examples include the Vertebrate Genomes Project (Rhie et al., 2021), the Bird 10,000 Genomes Project (Feng et al., 2020), the Zoonomia Project (Zoonomia Consortium et al., 2020), the Darwin Tree of Life (Threlfall and Blaxter, 2021), the Earth Biogenome Project (Lewin et al., 2018), and the 5000 Arthropod Genomes Initiative (Robinson et al., 2011a). In addition to establishing new standards for modern large-scale genomics projects and opening avenues for genomic research that were previously only feasible in model organisms across a multitude of species, these projects are creating an opportunity to study genetic variation and address fundamental biological questions at a scope that was simply not possible before. In many respects, the foundation for modern genomics was built by those studying the vinegar (also called fruit or pomace) fly Drosophila melanogaster and related species in the family Drosophilidae. As a premier model organism for genetic and biological research since the foundational work of Morgan and colleagues, D. melanogaster was, after C. elegans, the second metazoan organism to undergo whole-genome sequencing (Adams et al., 2000). At that time, the completion of the D. melanogaster genome proved the viability of shotgun sequencing approaches and paved the way for larger, more complicated genomes (Hales et al., 2015). The genomic tractability that made drosophilids attractive for this work has led to their continued widespread use as model organisms in the genomic era: the whole-genome sequencing of 12 Drosophila species (Clark et al., 2007) and the characterization of functional elements in Drosophila genomes (Roy et al., 2010) are prominent milestones in the history of modern genomics. As it is a popular model system, an extensive collection of genomic resources exists for drosophilids today. Excluding genomes from this study, there are representative genome assemblies available on NCBI databases (GenBank and RefSeq) for about 75 different drosophilid species (Hotaling et al., 2021). About a third of these genomes are provided as chromosome-level scaffolds. Along with this diverse catalog of whole-genome sequences are collections of expression and regulation data (Chen et al., 2014; Roy et al., 2010), maps of constrained (i.e. functional) sequences inferred with comparative genomics tools (Stark et al., 2007), and population genomic data (e.g. Guirao-Rico and González, 2019; Lack et al., 2016; Signor et al., 2018). Well-studied D. melanogaster was among the first species to have high-quality genomes assembled for multiple individuals, revealing population variation in structural variants (Chakraborty et al., 2019; Long et al., 2018). Yet even with the intense scientific interest and effort thus far, only a small portion of the remarkably diverse drosophilids, a family which includes over 1600 described and possibly thousands of other undescribed species (O'Grady and DeSalle, 2018), is available for genomic study today. There is much scientific potential to be unlocked by improving the catalog of genomic diversity within this group, and the simplification that long reads bring to the genome assembly process is key. Long reads have proved to be a way to quickly generate affordable yet high-quality genomes, in fact the cost of a highly contiguous and complete Drosophila assembly based on long-read sequencing was recently estimated to be about $1,000 US dollars (Miller et al., 2018; Solares et al., 2018), orders of magnitude less than the first D. melanogaster genome. While a number of studies have already used long reads to assemble the genomes of one or a few drosophilid species (Bracewell et al., 2019; Chakraborty et al., 2021; Comeault et al., 2020; Flynn et al., 2020; Hill et al., 2020; Mai et al., 2020; Miller et al., 2018; Paris et al., 2020; Rezvykh et al., 2021; Solares et al., 2018), a sequencing and genome assembly project at a scale similar to that of the large genome assembly consortia, especially without similar resources and funding, remains challenging even with the benefits of long reads. Yet, there continue to be rapid improvements to long-read sequencing that may alleviate some of these logistical challenges. Long-read sequencing costs have dropped significantly in the past few years as protocols, kits, and the underlying technology improves. Ultra-long (50–100 kb or longer) reads are obtainable with Oxford Nanopore (ONT) sequencing and under the right conditions should allow entire chromosomes to be fully assembled without additional time-consuming and costly scaffolding methods (e.g. Nurk et al., 2021). By simplifying the genome assembly process and reducing the cost of genome assembly even further, these techniques finally make it possible to assemble tens or hundreds of drosophilid genomes at a time. Here, we present another step toward a comprehensive drosophilid genome dataset: a community resource of 101 de novo genome assemblies from 93 drosophilid species. These genomes were assembled using lines contributed by Drosophila researchers from across the world, and represent a diversity of ecologies and geographical distributions. We improve upon the Nanopore-based hybrid assembly (Nanopore plus Illumina) approach for Drosophila lines (Miller et al., 2018) to substantially increase the sequencing throughput contained in ultra-long reads while reducing overall costs. The contiguity, completeness, and quality of these genomes is assessed. We show that under ideal conditions, about two Drosophila lines (assuming an average 180 Mb genome) can be sequenced to at least 30× depth of coverage per ONT r9.4.1 (rev D) flow cell, at an approximate cost of 350 US dollars per line. Along with this manuscript and data, we provide a detailed Nanopore sequencing laboratory protocol specifically optimized for Drosophila lines, along with containerized computational pipelines. These genome assemblies and technical resources should facilitate the process of conducting large-scale genome projects in this key model clade and beyond. Results and discussion Taxon sampling Our selection of species and strains for sequencing (Table 1) improves the geographic, ecological, and phylogenetic diversity of genomic data from the family Drosophilidae. Most (99 of 101) of the genome assemblies presented here are from 14 species groups in subgenera Drosophila and Sophophora of the subfamily Drosophilinae (Toda, 2020). One species of each of the genera Leucophenga and Chymomyza, both contained in less-studied sister subfamily Steganinae, have also been sequenced. We note some taxonomic inconsistencies arising from the paraphyly or polyphyly of certain drosophilid taxa (Finet et al., 2021; O'Grady and DeSalle, 2018; Yassin, 2013) but will make no attempt to address those issues here. The sequenced species originate from mainland and island locations in North America, Europe, Africa, and Asia; are distributed from northern (e.g. D. tristis, D.littoralis) to equatorial (e.g. D. bocqueti) latitudes; represent two independent transitions to leaf-mining herbivory (Scaptomyza and Lordiphosa); and for some species, like the pest Zaprionus indianus, represent reproductively isolated populations taken from throughout the range. For difficult to culture species, for instance Leucophenga varia and some Lordiphosa spp., only wild-caught flies were sequenced. Finally, we have sequenced lines in active research use. Additional genomic resources like gene expression or population data should be expected in the near future to accompany many of these assemblies. For species where multiple lines were assembled, we have selected a recommended line to use based on genome quality and denote this recommendation in Table 1. Table 1 Species and strain information for all samples assembled for this work. Note: Species group and subgroup information is taken from the NCBI Taxonomy Browser with slight modifications following O'Grady and DeSalle, 2018. Strain names along with corresponding NDSSC and Kyoto DGRC stock center numbers are provided to the best of our knowledge. See Supplementary file 1 and Supplementary file 6 for detailed information on samples and data. When multiple lines of a species are listed, * denotes the preferred assembly. SubgenusGroupSubgroupSpeciesSexStrain nameNDSSCKyoto DGRC/ EhimeAdditional notesSophophoramelanogastermelanogasterD. melanogasterMFISO-1 GENOME14021-0231.36NABDGP reference strainD. mauritianaFNA14021-0241.01NAMiller et al., 2018D. simulansFNA14021-0251.006NAMiller et al., 2018D. sechelliaFNA14021-0248.01NAMiller et al., 2018D. teissieri *M273.3NANAD. teissieriMCT02NANAD. yakubaFNA14021-0261.01NAMiller et al., 2018D. erectaFNA14021-0224.01NAMiller et al., 2018eugracilisD. eugracilisFNA14026-0451.02NAMiller et al., 2018suzukiiD. subpulchrellaML1NANAD. biarmipesMF361.0 iso1 l-11 GENOME strain 114023-0361.10NAmodENCODE straintakahashiiD. takahashiiFIR98-3 E-12201NAE-912201inbred derivative of Ehime stock IR98-3ficusphilaD. ficusphilaF631.0-iso1 l-10 GENOME14025-0441.05NAmodENCODE strainrhopaloaD. carrolliMFKB866NANAD. rhopaloaMFBaVi067 GENOME14029-0021.01E-24701modENCODE strainD. kurseongensisFSaPa58NANAD. fuyamaiFKB-121714029-0011.01NAelegansD. elegansFHK0461.03 GENOME14027-0461.03NAmodENCODE strainsuzukiiD. oshimaiMMT-04NANAmontiumD. bocquetiMYAK3_mont-66NANAD. sp aff chauvacaeMmont_up-71NANAD. jambulinaMFst-214028-0671.01NAD. kikkawaiF561.0-iso4 l-10 GENOME14028-0561.14NAmodENCODE strainD. rufaFEH091 iso-C L_3NA914802inbred derivative of Ehime stock EH091D. triaurariaFNA14028-0691.9NAMiller et al., 2018; previously mis-identified as D. kikkawaiananassaeD. malerkotliana pallensFpalQ-isoGNANAD. malerkotliana malerkotlianaMFmal0-isoC14024-0391.00NAinbred derivative of strain 14024-0391.00D. bipectinataMF4-4-2-3-1-1-1-1-1 BackUp14024-0381.04NAInbred derivative of NDSSC strainD. parabipectinataMFpar2-isoB14024-0401.02NAinbred derivative of strain 14024-0401.02 (now extinct)D. pseudoananassae pseudoananassaeFWau 125NANAD. pseudoananassae nigrensFVT04-31NANAD. ananassaeF14024-0371.13NANAMiller et al., 2018D. variansMFCKM15-L1NANAD. ercepeaceMF164-1414024-0432.00NAobscuraobscuraD. ambiguaMR42NANAisofemale strain from the wildD. tristisMD2NANAisofemale strain from the wildD. obscuraMBZ-5NANAisofemale strain from the wildD. subobscuraMKüsnachtNANAstandard laboratory strainpseudoobscuraD. persimilisFNA14011-0111.01NAMiller et al., 2018D. pseudoobscuraFNA14011-0121.94NAMiller et al., 2018willistoniwillistoniD. willistoni (Uruguay) *ML-G314030-0811.17NAD. willistoniFNA14030-0811.00NAMiller et al., 2018D. paulistorum L06 *M(Heed) H66.1C14030-0771.06NAD. paulistorum L12ML1214030-0771.12NAD. tropicalisM(Heed) H65.214030-0801.00NAD. insularisMjp01iNANAisofemale line from J. PowellbocainensisD. sucineaM49.1514030-0791.01NAD. nebulosaMH176.1014030-0761.01NAsaltanssaltansD. saltansM(Heed) H180.4014045-0911.00NAD. prosaltansM(Heed) H29.614045-0901.02NAneocordataD. neocordataM2536.714041-0831.00NAsturtevantiD. sturtevantiFH191.2314043-0871.01NALordiphosamikiL. clarofinisMFGuizhou062018LCNANALine inbred for 2 generations in the lab before sequencingL. stackelbergiMFUCILTSSapporo052019LSNANAPool of 50 wild-caught fliesL. magnipectinataMFUCKTSapporo052019LMNANAPool of 50 wild-caught fliesfenestrarumL. collinellaMFUCKTSapporo052019LCNANAPool of 30 wild-caught fliesL. mommaiMFMMSapporo052014LMNANADrosophilaZaprionusvittigerZ. nigranusMst01nNANAline derived from wild collectionZ. camerounensisMjd01camNANAisofemale line from J. DavidZ. lachaiseiMjd01lNANAline derived from wild collectionZ. vittigerMjd01vNANAisofemale line from J. DavidZ. davidiMjd01dNANAisofemale line from J. DavidZ. taronusMst01tNANAline derived from wild collectionZ. capensisMjd01capNANAisofemale line from J. DavidZ. gabonicusMjd01gabNANAisofemale line from J. DavidZ. indianus RCR04MRCR04NANAZ. indianus 16GNV01M16GNV01NANAZ. indianus BS02 *MBS02NANAZ. indianus CDD18MCDD18NANAZ. africanusMBS06NANAZ ornatusMjd01oNANAisofemale line from J. DavidtuberculatusZ. tsacasiMcar7-4NANAZ. tsacasi *Mjd01tNANAisofemale line from J. DavidinermisZ. kolodkinaeMjd01kNANAisofemale line from J. DavidZ. inermisM18BSZ10NANAZ. ghesquiereiMjd01gheNANAisofemale line from J. DavidcardinidunniD. dunniMH254.2115182-2291.00NAD. arawakanaMMONHI050227(B)-10415182-2261.03NAcardiniD. cardiniMNA15181-2181.03917701funebrisfunebris?undescribed (Sao Tome mushroom)Mst01mNANAundescribed species collected on mushroom, Sao TomefunebrisD. funebrisMfst01NANAline derived from wild collectionimmigransimmigransD. immigrans *FFK05-1915111.1731.12NAD. immigrans kari17Mkari17NANA(incertae sedis)D. pruinosaMiso-A1 l-9NANAquadrilineataD. quadrilineataMquad-TMUNA914402tumiditarsusD. repletoidesMISZ-isoB I-10NANAScaptomyzaScaptomyzaS. montanaMFiso-CA-L1NANAS. graminumFTMU-2019NANA30 wild-caught femalesParascaptomyzaS. pallidaMFiso-CA-L1NANAHemiscaptomyzaS. hsuiMFiso-CA-L1NANAHawaiian DrosophilaorphnopezaD. sproatiMFDKPTOMS02NANAPool of wild-caught fliesD. murphyiMFDKPHETFM01NANAFlies from recently established but not inbred lab linegrimshawiD. grimshawiFNA15287-2541.00NASame line as caf1 genomevirilisvirilisD. virilisFNA15010-1051.87NAMiller et al., 2018D. americanaM3367.115010-0951.00NAAlso called Anderson strainD. littoralisMKilpisjärvi 1NANAOriginally misidentified as D. ezoana (Lankinen 1986, J Comp Physiol A 159: 123-142)repletarepletaD. repletaMkari30NANAmulleriD. mojavensisF15081-1352.22NANAMiller et al., 2018genus: LeucophengaL. variaMnc01vNANASequenced single wild-caught fly, no amplificationgenus: ChymomyzaC. costataMSapporoNANA * denotes the genome of best quality when multiple assemblies are available for a species. Near chromosome-scale assembly with ultra-long reads We sequenced the fly samples using a ONT 1D ligation kit approach, replacing magnetic bead cleanups with size selective precipitation. This modified workflow is optimized for genomic DNA extractions from 15 to 30 whole flies, increases the yield of ultra-long reads relative to the standard ligation kit protocol, increases overall sequencing throughput, and significantly reduces the cost of library preparation. Sequencing runs varied with sample quality and type, and in general read lengths and throughput increased over the course of this work with improved iterations of the protocol. Under optimal conditions and with enough starting material (at least 2,000 ng of very high molecular weight DNA) to prepare at least three library loads (~1200–500 ng total prepared library, 350–500 ng per load), along with regular DNAse flushes to maintain yields, Nanopore sequencing runs following the supplied protocol should net 12–15 Gb of data per R9.4.1 flow cell with a read N50 greater than 20 kb, and about 30% of data in reads longer than 50 kb. We generated paired-end, 150 bp Illumina reads for most strains unless public datasets were available. Deep (average 52×) sequencing coverage with a substantial fraction of ultra-long reads (Supplementary file 1) resulted in high-quality genome assemblies that were comparable to and often better than currently available reference genomes in terms of contiguity and completeness (Figure 1, Figure 1—figure supplement 1, Supplementary file 2). We chose Flye (Kolmogorov et al., 2019) as our assembler based on superior contiguity and favorable runtimes relative to Miniasm (Li, 2016) and Canu (Koren et al., 2017; Figure 1—figure supplement 2). To provide standardization for measures of contiguity, we estimated genome size for each assembly using long-read coverage over single-copy BUSCO loci (Supplementary file 2). Figure 1 with 4 supplements see all Download asset Open asset Nanopore-based assemblies are highly contiguous and complete. (A,B) Assembly contiguity is compared to the D. melanogaster v6.22 reference genome (blue) as well as five recently published, highly contiguous Illumina assemblies (red lines, D. birchii, D. bocki, D. bunnanda, D. kanapiae, D. truncata; Bronski et al., 2020). (A) Nx curves, or the (y-axis) size of each contig when contigs are sorted in descending size order, in relation to the (x-axis) cumulative proportion of the genome assembly that is covered. (B) The distribution of contig N50, the size of the contig at which 50% of the assembly is covered. (C) Assembly completeness assessed by BUSCO v4.0.6 (Seppey et al., 2019). Note, D. equinoxialis was evaluated with BUSCO v4.1.4 due to an issue with v4.0.6. L. stackelbergi has >10% missing BUSCOs. Individual assembly summary statistics are provided in Supplementary file 2. Of 101 total assemblies, 94 contain over 98% of the assembly in contigs larger than 10 kb, and both contig N50s and NG50s exceed 1 Mb for these genomes (Figure 1A, Figure 1B, Figure 1—figure supplement 3, Supplementary file 2). Assembly sizes were highly correlated with estimated genome sizes (Figure 1—figure supplement 4). In addition to meeting the megabase contig N50 standard for new genomes proposed by the Vertebrate Genomes Project (Rhie et al., 2021), these statistics show that most of the genome is present in the assembly in megabase-sized contigs. In other words, the assemblies are nearly at the chromosome level. For comparison, of the 76 representative drosophilid genomes that were previously available on NCBI (Hotaling et al., 2021), only 25 have an N50 greater than 1 Mb (Figure 1—figure supplement 1). Moreover, many of these highly contiguous NCBI genomes are scaffolded, an additional step that would have added a significant amount of time and additional expenses to this study. Even when DNA was extracted from pools of wild-caught flies or a single fly (Leucophenga varia) resulting in sub-optimal read lengths and output, the assembly was comparable to existing short read assemblies (Figure 1A, Figure 1B). High contiguity resulted in benchmarking universal single-copy ortholog (BUSCO) completeness (Seppey et al., 2019; Simão et al., 2015) in the range of 97–99+% for all but the three most fragmented genomes (Figure 1C). As with contiguity, the completeness of these genomes is comparable to reference genomes on NCBI (Figure 1—figure supplement 1). Estimates of sample diversity We have utilized a variety of fly samples, from highly inbred lab lines to wild-caught flies, for genome assembly. We therefore sought to quantify the level of diversity inherent to each sample and use variant calls to estimate the error rate for each assembly. Long and short reads (if available) were mapped separately to each finished genome and variant calling was performed with PEPPER-Margin-DeepVariant (Shafin et al., 2021) for long reads and BCFtools (Danecek et al., 2021; Li, 2011) for short reads. After quality filtering and masking genomic regions annotated as repeats, the counts of single nucleotide polymorphisms (SNPs), indels, and the fraction of sites with a non-reference SNP were computed (Figure 2, Supplementary file 3). Note, when short reads were not from the same strain as used for the assembly, short read polishing was used to only correct indels, and called SNPs will not accurately represent the variation in the sample that was sequenced with Nanopore. Also note that SNP calls from Nanopore data should be relatively accurate but indel calls will not (Shafin et al., 2021). Figure 2 with 1 supplement see all Download asset Open asset Estimated heterozygosity in the data used for genome assembly. Per-site SNP heterozygosity (number of heterozygous SNPs/number of callable sites) is plotted for each of the 101 assembled lines. Blue dots represent heterozygosity estimates from Nanopore reads with PEPPER-Margin-DeepVariant (Shafin et al., 2021). Orange dots represent heterozygosity estimates from short reads with BCFtools (Li, 2011). The genomes on the right are for species that did not have available short-read data. Numerical values for these estimates are provided in Supplementary file 4. Large variation in sample diversity over several orders of magnitude was observed. Estimated SNP heterozygosity, the number of heterozygous SNPs divided by the number of callable sites, ranged from 0.00035% to 1.1% from long reads and 0.0015% to 2.1% from short reads, and heterozygosity estimated from long reads was systematically lower than that from short reads, particularly when sample diversity was high (Figure 2, Figure S6). Qualitative patterns of heterozygosity generally tracked the history of the samples (e.g. the highly inbred reference strains had very low diversity). Conditioning on datasets where both long and short reads were generated from the same sample, heterozygosity estimates from both types of reads were positively correlated (Pearson correlation R2=0.50, p=1.13×10–12). If we ignore Lordiphosa, the group with wild-caught or recently collected samples that was consequently the most challenging to assemble, this correlation is greatly increased (Pearson correlation R2=0.81, p<2.2×10–16). Interestingly, we did not observe a significant relationship (p=0.30) between estimated heterozygosity and assembly contiguity (Figure 2—figure supplement 1). The number of heterozygous non-reference variants almost always exceeded the number of homozygous variants (Supplementary file 3), as would be expected from residual diversity in the sequenced lines. Estimates of sequence quality Next, we estimated the genome-wide error rates in our assemblies using both the variant calls obtained previously and a reference-free method (Supplementary file 4). For the first approach, Phred-scaled (Ewing et al., 1998) consensus quality (QV) was estimated by assuming all sites with a non-reference variant were an error. The error rate was then computed by dividing the number of sites with at least one non-reference variant by the total number of callable bases. As expected from the patterns of heterozygosity estimated from long and short reads, there was a large amount of variability in quality scores. Estimates from short reads ranged from QV17 to QV45 and from long reads were slightly higher, from QV19 to QV52 (Supplementary file 4). This method is likely to be biased by assembly features that affect the quality of read mapping, for example, we remove sequences annotated as repeats when filtering the variant calls. To address this bias, we employed the reference-free approach implemented in Merqury (Rhie et al., 2020) for the 94 assemblies which had some kind of short-read data available (Figure 3A, Supplementary file 4). Estimated quality scores ranged from QV16 to QV40, and once again, samples for which reads from a different strain or a genetically diverse sample (i.e. wild samples or recent isolates) were used had the lowest estimated QV. Merqury-estimated QV was on average higher than consensus quality estimated by the variant calling methods, but the relative ranking of QV estimates remained largely consistent with QV based on short-read (Spearman’s ρ=0.642, p<2.2e-16) and long-read (Spearman’s ρ=0.684, p<2.2e-16) variant calls. Figure 3 with 2 supplements see all Download asset Open asset Nanopore-based Drosophila assemblies are accurate, particularly in coding regions. (A) Genome-wide, Phred quality scores estimated with the reference-free, k-mer based approach implemented in Merqury (Rhie et al., 2020). Merqury requires a short-read dataset to perform the evaluation. Filled circles represent QV estimates with short-read data from the same strain used for Nanopore sequencing, and empty circles denote estimates using short-read data from a different strain than used for Nanopore sequencing. (B, C, D) Phred quality score cutoffs for the bottom 10th percentile of 100 kb genomic windows, as evaluated with a reference-based approach, in coding sequences only. Quality scores are capped at 60 for visualization purposes. At least 90% of 100 kb windows are this accurate. Only Nanopore assemblies with an NCBI RefSeq genome counterpart of the same strain were evaluated. Accuracy is shown for SNVs (B), insertions (C), and deletions (D) separately. Additional details on quality score estimates are provided in Figure 3—figure supplement 1 and Supplementary file 4. While these estimates showed our genomes to mostly fall below the often-recommended QV40 threshold for reference genomes (Koren et al., 2019; Rhie et al., 2021), there are many reasons to expect that sequence quality in certain regions of the genome will be far better than the average. As expected, we found that QV estimates were particularly low when short-read data from a different sample was used for the estimation, as any true variation between strains will inflate the error rate. Because we sequenced pools of flies, residual polymorphism will be found in the data even when long and short reads are sampled from the same pool of flies. In these cases QV might be considered as a lower bound estimate of the true accuracy of the assembly. Additionally, complex coding sequences are likely to be far more accurate than other regions of the genome, like repeats, due to better short-read mapping. The single genome-wide estimates of QV we report obscure this variation. Nanopore-based assemblies are highly accurate in coding regions For these reasons, we found it critical to further examine how errors are distributed in Nanopore assemblies. Of particular concern is the accuracy of coding sequences. Gene annotation is an important and obvious next step after assembling a new genome, but Nanopore sequences are known to systematically contain indels in homopolymer runs that cannot be called accurately when a run exceeds the size of the nanopore reader head. Indel disruptions to otherwise highly accurate coding sequences would have a disproportionately large negative impact on protein prediction (Watson and Warr, 2019). On the other hand, it is likely that coding sequences are generally more accurate than the rest of the genome since short-read mapping is generally more reliable there. In theory, most exons should be free of errors somewhere between a genome-wide quality of QV30 to QV40 (Koren et al., 2019), but many of our assemblies do not appear to reach this benchmark. Reference-based quality assessments were used to better understand how error rates vary across different genomic elements. We downloaded the 8 NCBI RefSeq genome assemblies for which we had a Nanopore genome of the same species and strain: D. biarmipes, D. elegans, D. ficusphila, D. grimshawi, D. kikkawai, D. melanogaster, D. mojavensis, and D. rhopaloa. Using the ONT Pomoxis software, we aligned each Nanopore assembly to its corresponding reference genome and estimated QV in non-overlapping 100 kb windows, using the entire sequence, then only coding sequences, introns, intergenic regions, and repeats, using gene and repeat definitions provided through NCBI RefSeq. All differences between query and reference assemblies were considered to be errors. As expected, we found that sequence accuracy varied greatly within each genome assembly (Figure 3—figure supplement 1). Mean genome-wide QV ranged from QV15 to QV24 while median QV across the 100 kb windows ranged from QV14 to QV36. When looking only at coding sequences, mean QV ranged from QV23 to QV29, while the median window accuracy, with the exceptions of D. grimshawi (QV25) and D. rhopaloa (QV30), indicated complete identity (>QV50) between assembly and reference. For D. grimshawi and D. rhopaloa, SNVs were the primary contributor to the error rate and the number of indels was similar to the other genomes (median QV(indel)>50). Sequence accuracy was lower when looking at introns, intergenic regions, and repeats, in that order. However, regardless of the genomic element type, median QV across the windows always exceeded mean QV, often by more than QV10, or an order of magnitude difference in the error rate. In other words, differences between Nanopore and reference assemblies were clustered heavily into a few genomic regions, and most coding sequences were very accurate despite the seemingly high mean error rate (F