BioNano Genome Map Resource for Oryza Sativa Ssp. Japonica and Indica and Its Application in Rice Genome Sequence Correction and Gap Filling
Ping Chen,Xinyun Jing,Baosheng Liao,Yan Zhu,Jiang Xu,Renyi Liu,Yinhong Zhao,Xuan Li
DOI: https://doi.org/10.1016/j.molp.2017.02.003
IF: 27.5
2017-01-01
Molecular Plant
Abstract:As one of the most important staple food crops worldwide, rice is among the first plant species whose genomes were sequenced. The reference genome for Oryza sativa ssp. japonica variety Nipponbare was completed by the International Rice Genome Sequencing Project (IRGSP) using a bacterial artificial chromosome (BAC)-based cloning strategy (Matsumoto et al., 2005Matsumoto T. Wu J.Z. Kanamori H. Katayose Y. Fujisawa M. Namiki N. Mizuno H. Yamamoto K. Antonio B.A. Baba T. et al.The map-based sequence of the rice genome.Nature. 2005; 436: 793-800Crossref PubMed Scopus (2968) Google Scholar). IRGSP1.0 was the latest release (Kawahara et al., 2013Kawahara Y. de la Bastide M. Hamilton J.P. Kanamori H. McCombie W.R. Ouyang S. Schwartz D.C. Tanaka T. Wu J.Z. Zhou S.G. et al.Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data.Rice. 2013; 6: 4Crossref Scopus (1023) Google Scholar) as the result of iterative updates. Rice Oryza sativa ssp. indica is estimated to account for over 70% of the world's rice production. A representative variety of indica, 93-11, was sequenced by Beijing Institute of Genome using a whole-genome shotgun approach and also underwent updates (Yu et al., 2002Yu J. Hu S. Wang J. Wong G.K. Li S. Liu B. Deng Y. Dai L. Zhou Y. Zhang X. et al.A draft sequence of the rice genome (Oryza sativa L. ssp. indica).Science. 2002; 296: 79-92Crossref PubMed Scopus (2557) Google Scholar, Gao et al., 2013Gao Z.Y. Zhao S.C. He W.M. Guo L.B. Peng Y.L. Wang J.J. Guo X.S. Zhang X.M. Rao Y.C. Zhang C. et al.Dissecting yield-associated loci in super hybrid rice by resequencing recombinant inbred lines and improving parental genome sequences.Proc. Natl. Acad. Sci. USA. 2013; 110: 14492-14497Crossref PubMed Scopus (130) Google Scholar). Assembly errors and DNA structural defects are major issues dogging sequenced genomes today, more so for indica 93-11 (Yu et al., 2006Yu J. Ni P.X. Wong G.K.S. Comparing the whole-genome-shotgun and map-based sequences of the rice genome.Trends Plant Sci. 2006; 11: 387-391Abstract Full Text Full Text PDF PubMed Scopus (14) Google Scholar), as issues at sequence structural level, i.e., insertions, deletions, translocations, and copy number variations are difficult to resolve with short-read sequencing technology. Unfilled gaps that contain complex sequences (often referred to as dark matter) remained with both rice genomes. Array-based comparative genomic hybridization (aCGH) (Bruce et al., 2009Bruce M. Hess A. Bai J.F. Mauleon R. Diaz M.G. Sugiyama N. Bordeos A. Wang G.L. Leung H. Leach J.E. Detection of genomic deletions in rice using oligonucleotide microarrays.BMC Genomics. 2009; 10: 129Crossref PubMed Scopus (37) Google Scholar) and physical mapping with BAC libraries (Oryza Map Alignment Project) (Jacquemin et al., 2013Jacquemin J. Bhatia D. Singh K. Wing R.A. The International Oryza Map Alignment Project: development of a genus-wide comparative genomics platform to help solve the 9 billion-people question.Curr. Opin. Plant Biol. 2013; 16: 147-156Crossref PubMed Scopus (92) Google Scholar) were among the first technologies used to detect and catalog sequence structural features for O. sativa varieties and relatives in genus Oryza. Another approach, optical mapping, represented by the latest BioNano technology (Lam et al., 2012Lam E.T. Hastie A. Lin C. Ehrlich D. Das S.K. Austin M.D. Deshpande P. Cao H. Nagarajan N. Xiao M. et al.Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly.Nat. Biotechnol. 2012; 30: 771-776Crossref PubMed Scopus (436) Google Scholar), offered significant advantages by revealing enzyme recognition site patterns across long DNA molecules. BioNano enabled high resolution and high throughput using nanochannel chips and fluorescent labeling, and rapidly expanded its applications to the study of large complex genomes (Cao et al., 2014Cao H.Z. Hastie A.R. Cao D.D. Lam E.T. Sun Y.H. Huang H.D. Liu X. Lin L.Y. Andrews W. Chan S. et al.Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology.Gigascience. 2014; 3: 34Crossref PubMed Scopus (105) Google Scholar, Stankova et al., 2016Stankova H. Hastie A.R. Chan S. Vrana J. Tulpova Z. Kubalakova M. Visendi P. Hayashi S. Luo M.C. Batley J. et al.BioNano genome mapping of individual chromosomes supports physical mapping and sequence assembly in complex plant genomes.Plant Biotechnol. J. 2016; 14: 1523-1531Crossref PubMed Scopus (68) Google Scholar). We also found that the BAC-based approach was sensitive to detect small-size structural changes, which was complementary to BioNano (Supplemental Notes). Using BioNano technology, we first built high-resolution optical maps for the genomes of Nipponbare and 93-11. We generated 71.4 and 64.7 Gb BioNano single-molecule data for Nipponbare and 93-11, respectively (Supplemental Table 1). To ensure the quality of BioNano data for subsequent analyses, the single-molecule data were filtered with a quality filter before they were used to assemble genome maps (termed BioNano maps) with stringent parameters (Supplemental Methods). For Nipponbare, a consensus genome map of 377.4 Mb was generated with 482 contigs and an N50 of 1.1 Mb (Supplemental Table 1 and Supplemental Dataset 1). It matched IRGSP1.0 at ∼96.0% (Supplemental Table 2). Similarly, a consensus genome map for 93-11 of 393.5 Mb was assembled with 394 contigs and an N50 of 1.4 Mb (Supplemental Table 1 and Supplemental Dataset 2). However, it had a much lower mapping rate (85.2%) (Supplemental Table 2) than the 93-11 reference, 93-11v2 (Gao et al., 2013Gao Z.Y. Zhao S.C. He W.M. Guo L.B. Peng Y.L. Wang J.J. Guo X.S. Zhang X.M. Rao Y.C. Zhang C. et al.Dissecting yield-associated loci in super hybrid rice by resequencing recombinant inbred lines and improving parental genome sequences.Proc. Natl. Acad. Sci. USA. 2013; 110: 14492-14497Crossref PubMed Scopus (130) Google Scholar). Given the differential mapping rates, we further compared the strings of labeling motifs in BioNano maps against corresponding genome regions and investigated discrepancies from diverging signals that represent likely assembly defects (Supplemental Methods). In IRGSP1.0, a catalog of 262 discrepancies with a total size of ∼8.78 Mb were found, which were categorized into four groups: insertions (I), deletions (II), inversions or translocations (III), and other complex discordances (IV) marked by restriction patterns distinctly different from BioNano maps (Figure 1A left and Supplemental Dataset 3). Note that chromosomes 1 and 4 had the most discrepancies (38 and 37, respectively), whereas chromosomes 3 and 6 had the fewest. To trace how Nipponbare reference evolved, the old version of Nipponbare reference, IRGSPv5 (Matsumoto et al., 2005Matsumoto T. Wu J.Z. Kanamori H. Katayose Y. Fujisawa M. Namiki N. Mizuno H. Yamamoto K. Antonio B.A. Baba T. et al.The map-based sequence of the rice genome.Nature. 2005; 436: 793-800Crossref PubMed Scopus (2968) Google Scholar), was also analyzed. A total of 269 discrepancies with a total size of ∼11.04 Mb were identified (Supplemental Dataset 3). Among them, five were corrected in IRGSP1.0. For example, BioNano contig29 aligned perfectly with IRGSP1.0 in region chr10:7961923-8042411, but diverged from IRGSPv5 in region chr10:7894177-7974664 (Figure 1B, upper left). Unexpectedly, we found newly introduced assembly errors in IRGSP1.0 that did not exist in IRGSPv5 (Figure 1B, lower left). Compared with IRGSP1.0, 93-11v2 had a greater number of discrepancies with its BioNano map; 3419 were identified (∼105.17 Mb total region size) and categorized into the same four groups as defined for IRGSP1.0 (Figure 1A, right, and Supplemental Dataset 4). Chromosomes 1 and 3 had the most discrepancies (428 and 357, respectively), whereas chromosomes 9 and 12 had the fewest. Further analysis of the old version of the 93-11 reference, 93-11v1 (Yu et al., 2006Yu J. Ni P.X. Wong G.K.S. Comparing the whole-genome-shotgun and map-based sequences of the rice genome.Trends Plant Sci. 2006; 11: 387-391Abstract Full Text Full Text PDF PubMed Scopus (14) Google Scholar) with the 93-11 BioNano map, identified 2566 discrepancies with a total size of ∼106.77 Mb (Supplemental Dataset 4). Among them, 333 (∼13.51 Mb) were corrected in 93-11v2. For example, the BioNano contig4 aligned well with 93-11v2 in region chr02:31626328-31669232, but diverged from 93-11v1 in region chr02:33054748-33127908 because of an erroneous insertion (Figure 1B, upper right). Surprisingly, 463 discrepancies (∼13.16 Mb) were newly introduced into 93-11v2 that did not exist in 93-11v1. One example pointed to a translocation in 93-11v2 (Figure 1B, lower right). With the newly available Zhenshan 97 and Minghui 63 genomes (Zhang et al., 2016Zhang J. Chen L.-L. Xing F. Kudrna D.A. Yao W. Copetti D. Mu T. Li W. Song J.-M. Xie W. Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63.Proc. Natl. Acad. Sci. USA. 2016; 113: E5163-E5171Crossref PubMed Scopus (154) Google Scholar), we found 1728 and 1710 of the 93-11v2 discrepancy regions had good matches between the 93-11 BioNano map and the genome of either Zhenshan 97 or Minghui 63, respectively (Supplemental Notes), pointing to common genomic elements among indica rice and affirming the accuracy of the BioNano map. In retrospect of both Nipponbare and 93-11 genomes, our results indicated that although their quality at nucleotide sequence level greatly improved with the advance of sequencing technology, progress was limited on issues associated with structural complexity in rice genomes. We next focused on three specific types of sequence structural issues in rice genomes, namely complex sequence structure in gaps, highly repetitive sequences, and BAC assembly errors. The rice reference genomes, particularly 93-11v2, have large gaps that often contain complex sequences. There were 39 and 9399 large gaps in Nipponbare and 93-11 genomes that were analyzed with BioNano maps (Supplemental Dataset 5). In IRGSP1.0, a gap in chr05:27174270-27175270 was found to have a 40-kb-long inverted repeat, unraveled by Nipponbare BioNano map (Figure 1C top). In 93-11v2, an example was found with chr09, where a 544-kb large insertion contained several 35-kb or 47-kb repeats, arranged alternately (Figure 1C, bottom). Note these complex sequences in gaps were supported by high coverage of single-molecule data. Highly repetitive sequences are almost intractable to short-read sequencing technologies. As a result, assemblies are heavily biased against repeats and duplications. To map out the tandem repeats in rice genomes, we systematically scanned the BioNano genome maps (Supplemental Methods) and found that both rice genomes, IRGSP1.0 and 93-11v2, underestimated frequencies of tandem repeats (Figure 1D and Supplemental Dataset 6). The peaks of 8.2-kb observed in Nipponbare and of 9.1-kb in 93-11 were likely repeats of different types of rDNA genes, concentrating on chromosome 9 in Nipponbare, and chromosomes 9 and 10 in 93-11 (Supplemental Figure 1). Based on BioNano data, we estimated the frequency of total repeats for rDNA genes in Nipponbare was ∼141, and in 93-11 was ∼215. Notably BioNano maps also helped refine different types of repeats/duplications in rice genomes (Supplemental Figure 2). Historically, genome assemblies relied on the BAC scaffolding approach. And BAC scaffolding errors could lead to sequence structural defects in rice genomes. One case was found with IRGSP1.0 in region Chr11:28547360-28644282, where BAC_3335 (AC120507) was missing (Figure 1E). Based on BioNano map contig640, BAC_3335 that was constructed from 23 contigs to form four repeats units locates between BAC_3334 and BAC_3336. Lastly, we developed a bioinformatics pipeline, BASCGF (BioNano Assisted Sequence Correction and Gap Filling), to reconstruct the 93-11 reference genome, taking advantage of the robust BioNano map resource and the homologous sequences shared between 93-11 and Nipponbare genomes. We reasoned it was possible to correct the defects in the 93-11 genome by referencing the high-quality Nipponbare sequences via the BioNano genome maps. We analyzed the 3419 discrepancies (Supplemental Dataset 4) identified in 93-11v2 by comparing their BioNano map fragments with the corresponding regions in the Nipponbare genome. Remarkably, 873 of 3419 were found to match between the 93-11 BioNano map and the Nipponbare genome, IRGSP1.0. An example is shown with the 93-11 BioNano map contig788, which aligned perfectly with IRGSP1.0 in chr12:1174639-1197895, but did not align with 93-11v2 (Figure 1F). Based on these results, BASCGF was designed to correct errors and fill in gaps in the 93-11 genome with the homologous Nipponbare sequences that were readily validated and polished with Illumina sequencing data generated from 93-11 DNA (Supplemental Methods and Supplemental Figure 3). The reconstructed 93-11 genome, 93-11v2Plus, had a total size of 372.48 Mb (Supplemental Dataset 8). It had 12.93-Mb sequences corrected, 5703 gaps closed, and 455 gaps partially filled (Supplemental Table 3), representing a significant improvement over 93-11v2. Changes were made to 864 genes, while 28 genes were newly added in filled gaps, and 836 were removed from existing loci in 93-11v2 and placed in the undetermined category in 93-11v2Plus. The above example (Figure 1F) was fixed in 93-11v2Plus by closing six gaps and removing the 23.3-kb erroneously anchored sequences. In summary, using BioNano technology, we built high-resolution genome map resources for rice Nipponbare and 93-11. We demonstrated the utility of BioNano maps in elucidating complex sequence structure in gaps, resolving highly repetitive sequences, and correcting BAC assembly errors. An improved 93-11 reference, 93-11v2Plus, was reconstructed with a bioinformatics pipeline, BASCGF. These results validate the value of the BioNano map resource and our strategy for BioNano map-based analysis. Considering the strengths of different sequencing technologies, we would suggest that the best-quality genome can be achieved by integrating BioNano mapping and PacBio and Illumina sequencing, i.e., building longer contigs with PacBio long reads, better scaffolding/assembly with BioNano map, and clearing sequence ambiguity/variation at single-nucleotide resolution with Illumina short reads (Supplemental Notes). Our study also pinpoints that dissecting genome structural complexity should be considered as a new priority in future genome studies. This work is supported in part by grants from the National Key Basic Research Program in China (2013CB127005), the National Natural Science Foundation of China (nos. 31401128, 31571310), and the Ministry of Agriculture of China (2016ZX08010-002), and by Special Fund for Strategic Pilot Technology Chinese Academy of Sciences (XDA08020104).