Abstract:Targeted amplicon sequencing is widely used in microbial ecology studies. However, sequencing artifacts and amplification biases are of great concern. To identify sources of these artifacts, a systematic analysis was performed using mock communities comprised of 16S rRNA genes from 33 bacterial strains. Our results indicated that while sequencing errors were generally isolated to low-abundance operational taxonomic units, chimeric sequences were a major source of artifacts. Singleton and doubleton sequences were primarily chimeras. Formation of chimeric sequences was significantly correlated with the GC content of the targeted sequences. Low-GC-content mock community members exhibited lower rates of chimeric sequence formation. GC content also had a large impact on sequence recovery. The quantitative capacity was notably limited, with substantial recovery variations and weak correlation between anticipated and observed strain abundances. The mock community strains with higher GC content had higher recovery rates than strains with lower GC content. Amplification bias was also observed due to the differences in primer affinity. A two-step PCR strategy reduced the number of chimeric sequences by half. In addition, comparative analyses based on the mock communities showed that several widely used sequence processing pipelines/methods, including DADA2, Deblur, UCLUST, UNOISE, and UPARSE, had different advantages and disadvantages in artifact removal and rare species detection. These results are important for improving sequencing quality and reliability and developing new algorithms to process targeted amplicon sequences.IMPORTANCEAmplicon sequencing of targeted genes is the predominant approach to estimate the membership and structure of microbial communities. However, accurate reconstruction of community composition is difficult due to sequencing errors, and other methodological biases and effective approaches to overcome these challenges are essential. Using a mock community of 33 phylogenetically diverse strains, this study evaluated the effect of GC content on sequencing results and tested different approaches to improve overall sequencing accuracy while characterizing the pros and cons of popular amplicon sequence data processing approaches. The sequencing results from this study can serve as a benchmarking data set for future algorithmic improvements. Furthermore, the new insights on sequencing error, chimera formation, and GC bias from this study will help enhance the quality of amplicon sequencing studies and support the development of new data analysis approaches. Amplicon sequencing of targeted genes is the predominant approach to estimate the membership and structure of microbial communities. However, accurate reconstruction of community composition is difficult due to sequencing errors, and other methodological biases and effective approaches to overcome these challenges are essential. Using a mock community of 33 phylogenetically diverse strains, this study evaluated the effect of GC content on sequencing results and tested different approaches to improve overall sequencing accuracy while characterizing the pros and cons of popular amplicon sequence data processing approaches. The sequencing results from this study can serve as a benchmarking data set for future algorithmic improvements. Furthermore, the new insights on sequencing error, chimera formation, and GC bias from this study will help enhance the quality of amplicon sequencing studies and support the development of new data analysis approaches.

UCHIME2: improved chimera prediction for amplicon sequencing

UCHIME improves sensitivity and speed of chimera detection

ChimeraMiner: An Improved Chimeric Read Detection Pipeline and Its Application in Single Cell Sequencing

Quantification of the effects of chimerism on read mapping, differential expression and annotation following short-read de novo assembly.

A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing

UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing

Exploration of whole genome amplification generated chimeric sequences in long-read sequencing data

Hotspot Selective Preference of the Chimeric Sequences Formed in Multiple Displacement Amplification

Unifying the analysis of bottom-up proteomics data with CHIMERYS

Streamlined and quantitative detection of chimerism using digital PCR

Error filtering, pair assembly and error correction for next-generation sequencing reads

UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy

Effects of Error, Chimera, Bias, and GC Content on the Accuracy of Amplicon Sequencing.

Interpolated hidden markov models estimated using conditional ML for eukaryotic gene annotation

Deciphering the 3D genome organization across species from Hi-C data

Champuru 2: Improved scoring of alignments and a user-friendly graphical interface

The end of protein structure prediction: Improving prediction accuracy in chimeric proteins by windowed multiple sequence alignment

UCSF ChimeraX: Tools for Structure Building and Analysis

Systematic Evaluation of Factors Influencing ChIP-seq Fidelity

Genomic Characterization and Curation of UCEs Improves Species Tree Reconstruction

Unique dual indexing PCR reduces chimeric contamination and improves mutation detection in cell-free DNA of pregnant women