Abstract:MotivationThe rapid development of next-generation sequencing technology provides an opportunity to study genome-wide DNA methylation at single-base resolution. However, depletion of unmethylated cytosines brings challenges for aligning bisulfite-converted sequencing reads to a large reference. Software tools for aligning methylation reads have not yet been comprehensively evaluated, especially for the widely used reduced representation bisulfite sequencing (RRBS) that involves enrichment for CpG islands (CGIs).ResultsWe specially developed a simulator, RRBSsim, for benchmarking analysis of RRBS data. We performed extensive comparison of seven mapping algorithms for methylation analysis in both real and simulated RRBS data. Eighteen lung tumors and matched adjacent tissues were sequenced by the RRBS protocols. Our empirical evaluation found that methylation results were less consistent between software tools for CpG sites with low sequencing depth, medium methylation level, on CGI shores or gene body. These observations were further confirmed by simulations that indicated software tools generally had lower recall of detecting these vulnerable CpG sites and lower precision of estimating methylation levels in these CpG sites. Among the software tools tested, bwa-meth and BS-Seeker2 (bowtie2) are currently our preferred aligners for RRBS data in terms of recall, precision and speed. Existing aligners cannot efficiently handle moderately methylated CpG sites and those CpG sites on CGI shores or gene body. Interpretation of methylation results from these vulnerable CpG sites should be treated with caution. Our study reveals several important features inherent in methylation data, and RRBSsim provides guidance to advance sequence-based methylation data analysis and methodological development.Availability and implementationRRBSsim is a simulator for benchmarking analysis of RRBS data and its source code is available at https://github.com/xwBio/RRBSsim or https://github.com/xwBio/Docker-RRBSsim.Supplementary informationSupplementary data are available at Bioinformatics online.

SimuSCoP: Reliably Simulate Illumina Sequencing Data Based on Position and Context Dependent Profiles

SCSsim: an Integrated Tool for Simulating Single-Cell Genome Sequencing Data.

Pirs: Profile-Based Illumina Pair-End Reads Simulator

IntSIM: an Integrated Simulator of Next-Generation Sequencing Data.

A Comprehensive Evaluation of Alignment Software for Reduced Representation Bisulfite Sequencing Data

SimCH: simulation of single-cell RNA sequencing data by modeling cellular heterogeneity at gene expression level

SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data

BSReadSim: a versatile and efficient simulator to generate realistic bisulfite sequencing reads

RealSeq2: a Software Integrated with UMI Identification, Error Correction, and Methylation Modifications Storing

MetaSMC: a Coalescent-Based Shotgun Sequence Simulator for Evolving Microbial Populations

simCAS: an embedding-based method for simulating single-cell chromatin accessibility sequencing data

scReadSim: a single-cell RNA-seq and ATAC-seq read simulator

Systematic Review of Next-Generation Sequencing Simulators: Computational Tools, Features and Perspectives

SVSR: A Program to Simulate Structural Variations and Generate Sequencing Reads for Multiple Platforms

Deepsimulator: A Deep Simulator For Nanopore Sequencing

NGSNGS: next-generation simulator for next-generation sequencing data

DeepSimulator1.5: a More Powerful, Quicker and Lighter Simulator for Nanopore Sequencing.

Too many needles in this haystack: algorithms for the analysis of next generation sequence data

NPBSS: a New PacBio Sequencing Simulator for Generating the Continuous Long Reads with an Empirical Model

GENOMICON-Seq: A comprehensive tool for the simulation of mutations in amplicon and whole exome sequencing

Simpute: an Efficient Solution for Dense Genotypic Data