Abstract:AbstractMetagenomics has in the last decade greatly revolutionized the study of microbial communities. However, the presence of artificial duplicate reads mainly raised from the preparation of metagenomic DNA sequencing library and their impacts on metagenomic assembly and binning have never brought to the attention. Here, we explicitly investigated the effects of duplicate reads on metagenomic assembly and binning, based on analyses of four groups of representative metagenomes with distinct microbiome complexity. Our results showed that deduplication considerably increased the binning yields (by 3.5% to 80%) for most of the metagenomic datasets examined thanks to improved contig length and coverage profiling of metagenome-assembled contigs. Specifically, 411 versus 397, 331 versus 317, 104 versus 88 and 9 versus 5 metagenome-assembled genomes (MAGs) were recovered from MEGAHIT assemblies of bioreactor sludge, surface water, lake sediment, and forest soil metagenomes, respectively. Noticeably, deduplication reduced the computational costs of metagenomic assembly including elapsed time (by 9.0% to 29.9%) and maximum memory requirement (by 4.3% to 37.1%). Collectively, it is recommended to remove duplicate reads in metagenomic data before assembly and binning analyses, particularly for complex environmental samples, such as forest soils examined in this study.ImportanceDuplicated reads are usually considered as technical artefacts. Their presence in metagenomes would theoretically not only introduce bias in the quantitative analysis, but also result in mistakes in coverage profile, leading to negative effects or even failures on metagenomic assembly and binning, as the widely used metagenome assemblers and binners all need coverage information for graph partitioning and assembly binning, respectively. However, this issue was seldomly noticed and its impacts on the downstream key bioinformatic procedures (e.g., assembly and binning) still remained unclear. In this study, we comprehensively evaluated for the first time the impacts of duplicate reads on de novo assembly and binning of real metagenomic datasets by comparing assembly quality, binning yields and the requirements of computational resources with and without the removal of duplicate reads. It was revealed that deduplication considerably increased the binning yields and significantly reduced the computational costs including elapsed time and maximum memory requirement. The results provide empirical reference for more cost-efficient metagenomic analyses in microbiome research.

An Improved Filtering Algorithm for Big Read Datasets

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Abstract 4956: A fast and efficient bioinformatics analysis workflow for processing reads from single-cell multiomics assays captured on a microwell-based platform

Next-generation data filtering in the genomics era

ntEdit: scalable genome sequence polishing

Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs

Klumpy: A Tool to Evaluate the Integrity of Long-Read Genome Assemblies and Illusive Sequence Motifs

Dime: A Novel Framework for De Novo Metagenomic Sequence Assembly

A read-filtering algorithm for high-throughput marker-gene studies that greatly improves OTU accuracy

Assembling large, complex environmental metagenomes

DNAscan: a fast, computationally and memory efficient bioinformatics pipeline for the analysis of DNA next-generation-sequencing data

Klumpy: A tool to evaluate the integrity of long‐read genome assemblies and illusive sequence motifs

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations

Error filtering, pair assembly and error correction for next-generation sequencing reads

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads

Deduplication Improves Cost-Efficiency and Yields of<i>De novo</i>Assembly and Binning of Shot-Gun Metagenomes in Microbiome Research

MiniScrub: de novo long read scrubbing using approximate alignment and deep learning

AGC: compact representation of assembled genomes with fast queries and updates

An open-sourced bioinformatic pipeline for the processing of Next-Generation Sequencing derived nucleotide reads: Identification and authentication of ancient metagenomic DNA