Soapbarcode: Revealing Arthropod Biodiversity Through Assembly of Illumina Shotgun Sequences of Pcr Amplicons

Shanlin Liu,Yiyuan Li,Jianliang Lu,Xu Su,Min Tang,Rui Zhang,Lili Zhou,Chengran Zhou,Qing Yang,Yinqiu Ji,Douglas W. Yu,Xin Zhou
DOI: https://doi.org/10.1111/2041-210x.12120
2013-01-01
Methods in Ecology and Evolution
Abstract:Summary Metabarcoding of mixed arthropod samples for biodiversity assessment has mostly been carried out on the 454 GS FLX sequencer (Roche, Branford, Connecticut, USA), due to its ability to produce long reads (≥400 bp) that are believed to allow higher taxonomic resolution. The Illumina sequencing platforms, with their much higher throughputs, could potentially reduce sequencing costs and improve sequence quality, but the associated shorter read length (typically <150 bp) has deterred their usage in next‐generation‐sequencing (NGS)‐based analyses of eukaryotic biodiversity, which often utilize standard barcode markers (e.g. COI, rbcL, matK, ITS) that are hundreds of nucleotides long. We present a new Illumina‐based pipeline to recover full‐length COI barcodes from mixed arthropod samples. Our new assembly program, SOAPBarcode, a variant of the genome assembly program SOAPdenovo, uses paired‐end reads of the standard COI barcode region as anchors to extract the correct pathways (sequences) out of otherwise chaotic ‘de Bruijn graphs’, which are caused by the presence of large numbers of COI homologs of high sequence similarity. Two bulk insect samples of known species composition have been analysed in a recently published 454 metabarcoding study (Yu et al. 2012) and are re‐analysed by our analysis pipeline. Compared to the results of Roche 454 (c. 400‐bp reads), our pipeline recovered full‐length COI barcodes (658 bp) and 17–31% more species‐level operational taxonomic units (OTUs) from bulk insect samples, with fewer untraceable (novel) OTUs. On the other hand, our PCR‐based pipeline also revealed higher rates of contamination across samples, due to the Illumina's increased sequencing depth. On balance, the assembled full‐length barcodes and increased OTU recovery rates resulted in more resolved taxonomic assignments and more accurate beta diversity estimation. The HiSeq 2000 and the SOAPBarcode pipeline together can achieve more accurate biodiversity assessment at a much reduced sequencing cost in metabarcoding analyses. However, greater precaution is needed to prevent cross‐sample contamination during field preparation and laboratory operation because of greater ability to detect non‐target DNA amplicons present in low‐copy numbers.
What problem does this paper attempt to address?