Abstract:Targeted amplicon sequencing is widely used in microbial ecology studies. However, sequencing artifacts and amplification biases are of great concern. To identify sources of these artifacts, a systematic analysis was performed using mock communities comprised of 16S rRNA genes from 33 bacterial strains. Our results indicated that while sequencing errors were generally isolated to low-abundance operational taxonomic units, chimeric sequences were a major source of artifacts. Singleton and doubleton sequences were primarily chimeras. Formation of chimeric sequences was significantly correlated with the GC content of the targeted sequences. Low-GC-content mock community members exhibited lower rates of chimeric sequence formation. GC content also had a large impact on sequence recovery. The quantitative capacity was notably limited, with substantial recovery variations and weak correlation between anticipated and observed strain abundances. The mock community strains with higher GC content had higher recovery rates than strains with lower GC content. Amplification bias was also observed due to the differences in primer affinity. A two-step PCR strategy reduced the number of chimeric sequences by half. In addition, comparative analyses based on the mock communities showed that several widely used sequence processing pipelines/methods, including DADA2, Deblur, UCLUST, UNOISE, and UPARSE, had different advantages and disadvantages in artifact removal and rare species detection. These results are important for improving sequencing quality and reliability and developing new algorithms to process targeted amplicon sequences.IMPORTANCEAmplicon sequencing of targeted genes is the predominant approach to estimate the membership and structure of microbial communities. However, accurate reconstruction of community composition is difficult due to sequencing errors, and other methodological biases and effective approaches to overcome these challenges are essential. Using a mock community of 33 phylogenetically diverse strains, this study evaluated the effect of GC content on sequencing results and tested different approaches to improve overall sequencing accuracy while characterizing the pros and cons of popular amplicon sequence data processing approaches. The sequencing results from this study can serve as a benchmarking data set for future algorithmic improvements. Furthermore, the new insights on sequencing error, chimera formation, and GC bias from this study will help enhance the quality of amplicon sequencing studies and support the development of new data analysis approaches. Amplicon sequencing of targeted genes is the predominant approach to estimate the membership and structure of microbial communities. However, accurate reconstruction of community composition is difficult due to sequencing errors, and other methodological biases and effective approaches to overcome these challenges are essential. Using a mock community of 33 phylogenetically diverse strains, this study evaluated the effect of GC content on sequencing results and tested different approaches to improve overall sequencing accuracy while characterizing the pros and cons of popular amplicon sequence data processing approaches. The sequencing results from this study can serve as a benchmarking data set for future algorithmic improvements. Furthermore, the new insights on sequencing error, chimera formation, and GC bias from this study will help enhance the quality of amplicon sequencing studies and support the development of new data analysis approaches.

Systematic Characteristic Exploration Of The Chimeras Generated In Multiple Displacement Amplification Through Next Generation Sequencing Data Reanalysis

Exploration of whole genome amplification generated chimeric sequences in long-read sequencing data

Hotspot Selective Preference of the Chimeric Sequences Formed in Multiple Displacement Amplification

ChimeraMiner: An Improved Chimeric Read Detection Pipeline and Its Application in Single Cell Sequencing

Pseudo-Sanger Sequencing: Massively Parallel Production of Long and Near Error-Free Reads Using NGS Technology

UCHIME improves sensitivity and speed of chimera detection

Quantification of the effects of chimerism on read mapping, differential expression and annotation following short-read de novo assembly.

Evaluation of Pcr-Generated Chimeras: Mutations, and Heteroduplexes with 16s Rrna Gene-Based Cloning

Analysis of Causes for Formation of Chimeras.

Origination of chimeric genes through DNA-level recombination.

Effects of Error, Chimera, Bias, and GC Content on the Accuracy of Amplicon Sequencing.

Chimeras Linked to Tandem Repeats and Transposable Elements in Tetraploid Hybrid Fish

The Use of Multiple Displacement Amplification to Amplify Complex DNA Libraries.

Genome Coverage and Sequence Fidelity of Phi 29 Polymerase-Based Multiple Strand Displacement Whole Genome Amplification

MDAGenera: an Efficient and Accurate Simulator for Multiple Displacement Amplification.

“Evaluating the Benefits and Limits of Multiple Displacement Amplification with Whole-Genome Oxford Nanopore Sequencing”

A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing

Streamlined and quantitative detection of chimerism using digital PCR

Too many needles in this haystack: algorithms for the analysis of next generation sequence data

Sources of erroneous sequences and artifact chimeric reads in next generation sequencing of genomic DNA from formalin-fixed paraffin-embedded samples

De Novo-Generated Small Palindromes Are Characteristic of Amplicon Boundary Junction of Double Minutes