Whole-Genome Sequencing and Analysis of the Chinese Herbal Plant Panax Notoginseng
Wei Chen,Ling Kui,Guanghui Zhang,Shusheng Zhu,Jing Zhang,Xiao Wang,Min Yang,Huichuan Huang,Yixiang Liu,Yong Wang,Yahe Li,Lipin Zeng,Wen Wang,Xiahong He,Yang Dong,Shengchao Yang
DOI: https://doi.org/10.1016/j.molp.2017.02.010
IF: 27.5
2017-01-01
Molecular Plant
Abstract:Panax notoginseng (Burk.) F.H. Chen (2n = 2x = 24, common name sanqi or tianqi), belonging to the Araliaceae family, is a slow-growing plant species documented in the ancient Chinese medical literatures for its ability to ameliorate hemostasis and improve blood circulation (Wang et al., 2016Wang T. Guo R. Zhou G. Zhou X. Kou Z. Sui F. Li C. Tang L. Wang Z. Traditional uses, botany, phytochemistry, pharmacology and toxicology of Panax notoginseng (Burk.) F. H. Chen: a review.J. Ethnopharmacol. 2016; 188: 234-258Crossref PubMed Scopus (243) Google Scholar). After decades of pharmacological research, a variety of P. notoginseng-specific secondary metabolites (notably ginsenosides, notoginsenosides and gypenosides) were isolated, identified, and implicated in conferring medicinal properties (Wang et al., 2016Wang T. Guo R. Zhou G. Zhou X. Kou Z. Sui F. Li C. Tang L. Wang Z. Traditional uses, botany, phytochemistry, pharmacology and toxicology of Panax notoginseng (Burk.) F. H. Chen: a review.J. Ethnopharmacol. 2016; 188: 234-258Crossref PubMed Scopus (243) Google Scholar). These discoveries allowed the design and production of numerous modern oral and topical drugs to treat cardiovascular diseases, contusions, and soft tissue pain (Wang et al., 2016Wang T. Guo R. Zhou G. Zhou X. Kou Z. Sui F. Li C. Tang L. Wang Z. Traditional uses, botany, phytochemistry, pharmacology and toxicology of Panax notoginseng (Burk.) F. H. Chen: a review.J. Ethnopharmacol. 2016; 188: 234-258Crossref PubMed Scopus (243) Google Scholar). To help identify novel bioactive compounds in P. notoginseng and delineate their biosynthetic pathways, we propose the construction of a reference P. notoginseng genome. This information will be an important addition to the existing expressed sequence tag (Luo et al., 2011Luo H. Sun C. Sun Y. Wu Q. Li Y. Song J. Niu Y. Cheng X. Xu H. Li C. et al.Analysis of the transcriptome of Panax notoginseng root uncovers putative triterpene saponin-biosynthetic genes and genetic markers.BMC Genomics. 2011; 12: S5Crossref PubMed Scopus (143) Google Scholar) and RNA-seq data (Liu et al., 2015Liu M.H. Yang B.R. Cheung W.F. Yang K.Y. Zhou H.F. Kwok J.S. Liu G.C. Li X.F. Zhong S. Lee S.M. et al.Transcriptome analysis of leaves, roots and flowers of Panax notoginseng identifies genes involved in ginsenoside and alkaloid biosynthesis.BMC Genomics. 2015; 16: 265Crossref PubMed Scopus (71) Google Scholar) for P. notoginseng genetics. Members of the Panax genus usually have large and highly heterozygous genomes. For instance, the genome sizes of the tetraploid P. ginseng and P. quinquefolius are 3.12 Gb and 4.91 Gb, respectively (Choi et al., 2014Choi H. Waminal N.E. Park H.M. Kim N.H. Choi B.S. Park M. Choi D. Lim Y.P. Kwon S. Park B. et al.Major repeat components covering one-third of the ginseng (Panax ginseng C.A. Meyer) genome and evidence for allotetraploidy.Plant J. 2014; 77: 906-916Crossref PubMed Scopus (47) Google Scholar). In this study, we first estimated the genome size of the diploid P. notoginseng to be about 2.31 Gb with flow cytometry analysis. This relatively large number prompted us to construct 34 Illumina paired-end libraries for the whole-genome sequencing (Supplemental Information for detailed methods; Supplemental Table 1). In total, about 1837.6 Gb of raw data were generated on two Illumina platforms. These data represented about 795-fold coverage of the P. notoginseng genome. After removing low-quality and duplicated reads, about 858.6 Gb of clean data were obtained for the de novo assembly of the P. notoginseng genome. The de novo assembly process yielded a draft P. notoginseng genome of 2.39 Gb, with a contig N50 size of 16 kb and scaffold N50 size of 96 kb (Supplemental Table 2). Evaluation of the completeness of this genome assembly by the core eukaryotic genes mapping approach (CEGMA) revealed that 198 of 248 ultra-conserved genes could be fully annotated (80% completeness, see Supplemental Table 3), and 239 of 248 ultra-conserved genes met the criterion for partial annotation (96% completeness). We also assessed the completeness of the P. notoginseng genome and annotation with common plant benchmarking universal single-copy orthologs (BUSCOs). The results showed that 1186 out of 1440 plant BUSCOs (82.4%) could be found in this genome assembly, and 47 plant BUSCOs (3.3%) had fragmented matches (Supplemental Figure 1). Analysis of the P. notoginseng genome using Tandem Repeat Finder identified about 127.6 Mb tandem repeats, accounting for 5.32% of the assembled genome. In comparison, the transposable element annotation revealed about 1.71 Gb repeat sequences, accounting for about 75.94% of the assembled genome (Supplemental Table 4). Among all transposable element families, long terminal repeats (LTR), which are important determinants of angiosperm genome size variation (Bennetzen and Wang, 2014Bennetzen J.L. Wang H. The contributions of transposable elements to the structure, function, and evolution of plant genomes.Annu. Rev. Plant Biol. 2014; 65: 19.1-19.26Crossref Scopus (311) Google Scholar), made up about 66.72% of the total sequence. A similar phenomenon was also observed in P. ginseng, in which five most abundant LTR subfamilies comprised 33% of its genome (Choi et al., 2014Choi H. Waminal N.E. Park H.M. Kim N.H. Choi B.S. Park M. Choi D. Lim Y.P. Kwon S. Park B. et al.Major repeat components covering one-third of the ginseng (Panax ginseng C.A. Meyer) genome and evidence for allotetraploidy.Plant J. 2014; 77: 906-916Crossref PubMed Scopus (47) Google Scholar). In this regard, the P. notoginseng genome provides an excellent model for studying the amplification history of the LTR families in Panax plants. To facilitate the protein-coding gene annotation process, we obtained RNA-seq data and the corresponding de novo transcriptome assemblies from the fruit, leaf, flower, stem, primary root, and secondary root samples of a single P. notoginseng plant (Supplemental Table 5). A combination of de novo, homology, and transcriptome-based predictions yielded 36 790 protein-coding genes in the P. notoginseng genome (Supplemental Table 6). The average mRNA length of the protein-coding genes was 3307 bp. In addition, we obtained 8446 copies of non-protein-coding genes (miRNA, tRNA, rRNA, and snRNA) in the P. notoginseng genome, which constituted about 0.044% of the total sequence (Supplemental Table 7). Ortholog clustering analysis and gene family clustering analysis were performed using OrthoMCL on all the protein-coding genes of P. notoginseng, Arabidopsis thaliana, Amborella trichopoda, Capsicum annuum, Carica papaya, Cucumis sativus, Malus domestica, Oryza sativa, Populus trichocarpa, Solanum tuberosum, and Vitis vinifera. In P. notoginseng, the 36 790 protein-coding genes are composed of 3181 single-copy orthologs, 7818 multiple-copy orthologs, 5843 unique paralogs, 9898 other paralogs, and 10 050 unclustered genes (Figure 1A). A total of 26 740 protein-coding genes can be clustered into 14 027 gene families, among which 1727 were unique gene families (Supplemental Table 8). In addition, gene family evolution analysis of the above mentioned plants revealed that 1423 gene families in P. notoginseng underwent expansion, whereas 3231 genes families underwent contraction (Figure 1B). Phylogenetic analysis showed that P. notoginseng diverged from members of the Solanaceae family, S. tuberosum and C. annuum, about 91.2 million years ago (Figure 1B). Since the genomes of other Panax plants are unknown, we constructed a phylogenetic tree of P. ginseng, P. notoginseng, and P. quinquefolius using their root transcriptomes. The result showed that the diploid P. notoginseng diverged from the other two tetraploid Panax plants before a putative whole-genome duplication event in their lineage (Supplemental Figure 2A). In addition, P. notoginseng shared 9383 transcripts with the other two Panax plants and contained 976 unique transcripts in the root transcriptome (Supplemental Figure 2B). Plant terpenes have been an important class of natural products for pharmaceutical screening and design (Tholl, 2006Tholl D. Terpene synthases and the regulation, diversity and biological roles of terpene metabolism.Curr. Opin. Plant Biol. 2006; 9: 1-8Crossref Scopus (547) Google Scholar). Despite their diverse chemical structures, these compounds are derived from two five-carbon isomeric basic building blocks: isopentenyl diphosphate (IPP) and dimethylallyl diphosphate (DMAPP) (Trapp and Croteau, 2001Trapp S.C. Croteau R.B. Genomic organization of plant terpene synthases and molecular evolutionary implications.Genetics. 2001; 158: 811-832PubMed Google Scholar). In plant, the de novo production of IPP and DMAPP involves the classic acetate/mevalonate pathway in the cytosol and the pyruvate/glyceraldehyde-3-phosphate pathway in the plastid, respectively (Tholl, 2006Tholl D. Terpene synthases and the regulation, diversity and biological roles of terpene metabolism.Curr. Opin. Plant Biol. 2006; 9: 1-8Crossref Scopus (547) Google Scholar). The attendant condensation of IPP and DMAPP in various combinations gives rise to different intermediate precursors (e.g., geranyl diphosphate) for the biosynthesis of plant terpenes (Tholl, 2006Tholl D. Terpene synthases and the regulation, diversity and biological roles of terpene metabolism.Curr. Opin. Plant Biol. 2006; 9: 1-8Crossref Scopus (547) Google Scholar). Here, in the P. notoginseng genome, we identified almost all the homologous genes for the enzymes involved in the biosynthesis of IPP, DMAPP, and various intermediate precursors (Supplemental Table 9). Notably, most of these homologous genes could be supported by the P. notoginseng transcriptome data from six plant organs. Depending on the number of IPP and DMAPP (C5) used for synthesizing terpenes, this group of chemicals can be categorized into monoterpenes (C10), sesquiterpenes (C15), diterpenes (C20), triterpenes (C30), and so on (Chen et al., 2011Chen F. Tholl D. Bohlmann J. Pichersky E. The family of terpene synthases in plants: a mid-size family of genes for specialized metabolism that is highly diversified throughout the kingdom.Plant J. 2011; 66: 212-229Crossref PubMed Scopus (806) Google Scholar). The key enzymes for producing these compounds from the building blocks and intermediate precursors are collectively called terpene synthases (TPSs). By applying a hidden Markov model-based homologous gene search method (Chen et al., 2011Chen F. Tholl D. Bohlmann J. Pichersky E. The family of terpene synthases in plants: a mid-size family of genes for specialized metabolism that is highly diversified throughout the kingdom.Plant J. 2011; 66: 212-229Crossref PubMed Scopus (806) Google Scholar), we identified 30 putative TPS genes in the P. notoginseng genome (Supplemental Table 10). The number of predicted P. notoginseng TPS genes is comparable to that found in A. thaliana, but much less than those found in grape, rice, poplar, and sorghum (Supplemental Table 10). Among these 30 putative TPS genes, there are 12 genes encoding protein products larger than 500 amino acids (Figure 1C), which may represent full-length TPS proteins (Chen et al., 2011Chen F. Tholl D. Bohlmann J. Pichersky E. The family of terpene synthases in plants: a mid-size family of genes for specialized metabolism that is highly diversified throughout the kingdom.Plant J. 2011; 66: 212-229Crossref PubMed Scopus (806) Google Scholar). Genomic DNA analysis of these 12 TPS genes revealed a well-conserved intron/exon organization pattern according to the alignment of functional domains and matching intron phase numbers (Figure 1C). This result is in accordance with the proposal that the majority of all TPS genes share a common origin (Trapp and Croteau, 2001Trapp S.C. Croteau R.B. Genomic organization of plant terpene synthases and molecular evolutionary implications.Genetics. 2001; 158: 811-832PubMed Google Scholar, Chen et al., 2011Chen F. Tholl D. Bohlmann J. Pichersky E. The family of terpene synthases in plants: a mid-size family of genes for specialized metabolism that is highly diversified throughout the kingdom.Plant J. 2011; 66: 212-229Crossref PubMed Scopus (806) Google Scholar). Particularly, the evolution of many TPS genes is characterized by the loss of introns and the conifer diterpene internal sequence (CDIS) domain (Trapp and Croteau, 2001Trapp S.C. Croteau R.B. Genomic organization of plant terpene synthases and molecular evolutionary implications.Genetics. 2001; 158: 811-832PubMed Google Scholar). Indeed, we showed in this analysis that only two of the 12 putative TPS genes, Ppse_ynau_021314 and Ppse_ynau_029077, contain the CDIS domain and three more introns in the glycosyl hydrolase-like domain. Further phylogenetic reconstruction of the putative TPS proteins (>500 aa) from P. notoginseng and eight other plants (Chen et al., 2011Chen F. Tholl D. Bohlmann J. Pichersky E. The family of terpene synthases in plants: a mid-size family of genes for specialized metabolism that is highly diversified throughout the kingdom.Plant J. 2011; 66: 212-229Crossref PubMed Scopus (806) Google Scholar) showed that ten of the 12 P. notoginseng TPSs belonged to the TPS-a1 and TPS-b subfamilies (Figure 1D), indicating their possible functions as monoterpene or sesquiterpene synthases (Chen et al., 2011Chen F. Tholl D. Bohlmann J. Pichersky E. The family of terpene synthases in plants: a mid-size family of genes for specialized metabolism that is highly diversified throughout the kingdom.Plant J. 2011; 66: 212-229Crossref PubMed Scopus (806) Google Scholar). In addition, correlation analyses of the gene expression levels of these 12 TPS genes showed that certain TPS genes have similar expression patterns across six plant organs (Figure 1E), and that certain plant organs have similar TPS expression patterns (Figure 1F). In conclusion, we have sequenced, assembled, and annotated the highly repetitive and complex P. notoginseng genome. With the support of RNA-seq data from six organs of P. notoginseng, we presented the genomic analysis of the IPP/DMAPP biosynthetic pathways and TPS genes. This information, together with the predicted GLYCOSYLTRANSFERASE genes (GT genes; Supplemental Table 11), not only lays the groundwork for studying the biosynthesis of known terpenoids in P. notoginseng but also provides ample genetic resources for identifying novel drug candidates in closely related Panax species. This work was supported by the Major Science and Technique Programs in Yunnan Province (no. 2016ZF001), National Natural Science Foundation of China (no. U1402262), and the Pilot Project for Establishing Social Service System through Agricultural Science and Education in Yunnan Province, Medical Plant Unit (2014NG003).