Abstract:Background: High throughput sequencing of environmental DNA has applications in biodiversity monitoring, taxa abundance estimation, understanding the dynamics of community ecology, and marine species studies and conservation. Environmental DNA, especially, marine eDNA, has a fast degradation rate. Aside from the good quality reads, the data could have a significant number of reads that fall slightly below the default PHRED quality threshold of 30 on sequencing. For quality control, trimming methods are employed, which generally precede the merging of the read pairs. However, in the case of eDNA, a significant percentage of reads within the acceptable quality score range are also dropped. Methods: To infer the ideal merge tool that is sensitive to eDNA, two Hiseq paired-end eDNA datasets were utilized to study the merging by the tools - FLASH (Fast Length Adjustment of SHort reads), PANDAseq, COPE, BBMerge, and VSEARCH without preprocessing. We assessed these tools on the following parameters: Time taken to process, the quality, and the number of merged reads. Trimmomatic, a widely-used preprocessing tool, was also assessed by preprocessing the datasets at different parameters for the two approaches of preprocessing: Sliding Window and Maximum Information. The preprocessed read pairs were then merged using the ideal merge tool identified earlier. Results: FLASH is the most efficient merge tool balancing data conservation, quality of reads, and processing time. We compared Trimmomatic's two quality trimming options with increasing strictness with FLASH's direct merge. The raw reads processed with Trimmomatic then merged, yielded a significant drop in reads compared to the direct merge. An average of 29% of reads was dropped when directly merged with FLASH. Maximum Information option resulted in 30.7% to 68.05% read loss with lowest and highest stringency parameters, respectively. The Sliding Window approach conserves approximately 10% more reads at a PHRED score of 25 set as the threshold for a window of size 4. The lowered PHRED cut off conserves about 50% of the reads that could potentially be informative. We noted no significant reduction of data while optimizing the number of reads read in a window with the ideal quality (Q) score. Conclusions: Losing reads can negatively impact the downstream processing of the environmental data, especially for sequence alignment studies. The quality trim-first-merge-later approach can significantly decrease the number of reads conserved. However, direct merging of pair-end reads using FLASH conserved more than 60% of the reads. Therefore, direct merging of the paired-end reads can prevent potential removal of informative reads that do not comply by the trimming tool's strict checks. FLASH to be an efficient tool in conserving reads while carrying out quality trimming in moderation. Overall, our results show that merging paired-end reads of eDNA data before trimming can conserve more reads.

PEAR: a fast and accurate Illumina Paired-End reAd mergeR

Pseudo-Sanger Sequencing: Massively Parallel Production of Long and Near Error-Free Reads Using NGS Technology

Benchmarking software tools for trimming adapters and merging next-generation sequencing data for ancient DNA

MeFiT: merging and filtering tool for illumina paired-end reads for 16S rRNA amplicon sequencing

Benefits of merging paired-end reads before pre-processing environmental metagenomics data

Pacybara: accurate long-read sequencing for barcoded mutagenized allelic libraries

Pirs: Profile-Based Illumina Pair-End Reads Simulator

Accelerating spliced alignment of long RNA sequencing reads using parallel maximal exact match retrieval

Perm: Efficient Mapping of Short Sequencing Reads with Periodic Full Sensitive Spaced Seeds

ProSynAR: a reference aware read merger

MEEPTOOLS: A maximum expected error based FASTQ read filtering and trimming toolkit

GapReduce: A Gap Filling Algorithm Based on Partitioned Read Sets

An efficient Burrows-Wheeler transform-based aligner for short read mapping

Sap-A Sequence Mapping And Analyzing Program For Long Sequence Reads Alignment And Accurate Variants Discovery

Achieving the photon up-conversion thermodynamic yield upper limit by sensitized triplet-triplet annihilation.

RASSA: Resistive Pre-Alignment Accelerator for Approximate DNA Long Read Mapping

PET-Tool: a Software Suite for Comprehensive Processing and Managing of Paired-End Ditag (PET) Sequence Data.

STAR: ultrafast universal RNA-seq aligner

diBELLA: Distributed Long Read to Long Read Alignment

Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads