Abstract:Abstract Background Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from next-generation sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. Although variant caller benchmarks are published constantly, no previous publications have leveraged the full extent of available gold standard whole-genome (WGS) and whole-exome (WES) sequencing datasets. Results In this work, we systematically evaluated the performance of 4 popular short read aligners (Bowtie2, BWA, Isaac, and Novoalign) and 9 novel and well-established variant calling and filtering methods (Clair3, DeepVariant, Octopus, GATK, FreeBayes, and Strelka2) using a set of 14 “gold standard” WES and WGS datasets available from Genome In A Bottle (GIAB) consortium. Additionally, we have indirectly evaluated each pipeline’s performance using a set of 6 non-GIAB samples of African and Russian ethnicity. In our benchmark, Bowtie2 performed significantly worse than other aligners, suggesting it should not be used for medical variant calling. When other aligners were considered, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. Among the tested variant callers, DeepVariant consistently showed the best performance and the highest robustness. Other actively developed tools, such as Clair3, Octopus, and Strelka2, also performed well, although their efficiency had greater dependence on the quality and type of the input data. We have also compared the consistency of variant calls in GIAB and non-GIAB samples. With few important caveats, best-performing tools have shown little evidence of overfitting. Conclusions The results show surprisingly large differences in the performance of cutting-edge tools even in high confidence regions of the coding genome. This highlights the importance of regular benchmarking of quickly evolving tools and pipelines. We also discuss the need for a more diverse set of gold standard genomes that would include samples of African, Hispanic, or mixed ancestry. Additionally, there is also a need for better variant caller assessment in the repetitive regions of the coding genome.

GenMPI: Cluster Scalable Variant Calling for Short/Long Reads Sequencing Data

Gene Sequence Alignment on a Public Computing Platform

Integer programming framework for pangenome-based genome inference

Accelerating Minimap2 for Accurate Long Read Alignment on GPUs

Rheumatologists’ Approaches to Diagnosis and Treatment of Depression

Menopausal hormone therapy for vasomotor symptoms: balancing the risks and benefits with ultra-low doses of estrogen

Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery

PyroHMMsnp: an SNP Caller for Ion Torrent and 454 Sequencing Data.

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Minimap2: pairwise alignment for nucleotide sequences

GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality

MaxSSmap: A GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence

gpuPairHMM: High-speed Pair-HMM Forward Algorithm for DNA Variant Calling on GPUs

Fish Bone Induced Sialolith in Warthon Duct.

Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads

Fast and accurate short read alignment with hybrid hash-tree data structure

An efficient Burrows-Wheeler transform-based aligner for short read mapping

A Novel Multi-Alignment Pipeline for High-Throughput Sequencing Data.

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Acceleration of short and long DNA read mapping without loss of accuracy using suffix array