Abstract:Hierarchical genotyping approaches can provide insights into the source, geography and temporal distribution of bacterial pathogens. Multiple hierarchical SNP genotyping schemes have previously been developed so that new isolates can rapidly be placed within pre-computed population structures, without the need to rebuild phylogenetic trees for the entire dataset. This classification approach has, however, seen limited uptake in routine public health settings due to analytical complexity and the lack of standardized tools that provide clear and easy ways to interpret results. The BioHansel tool was developed to provide an organism-agnostic tool for hierarchical SNP-based genotyping. The tool identifies split k-mers that distinguish predefined lineages in whole genome sequencing (WGS) data using SNP-based genotyping schemes. BioHansel uses the Aho-Corasick algorithm to type isolates from assembled genomes or raw read sequence data in a matter of seconds, with limited computational resources. This makes BioHansel ideal for use by public health agencies that rely on WGS methods for surveillance of bacterial pathogens. Genotyping results are evaluated using a quality assurance module which identifies problematic samples, such as low-quality or contaminated datasets. Using existing hierarchical SNP schemes for Mycobacterium tuberculosis and Salmonella Typhi, we compare the genotyping results obtained with the k-mer-based tools BioHansel and SKA, with those of the organism-specific tools TBProfiler and genotyphi, which use gold-standard reference-mapping approaches. We show that the genotyping results are fully concordant across these different methods, and that the k-mer-based tools are significantly faster. We also test the ability of the BioHansel quality assurance module to detect intra-lineage contamination and demonstrate that it is effective, even in populations with low genetic diversity. We demonstrate the scalability of the tool using a dataset of ~8100 S. Typhi public genomes and provide the aggregated results of geographical distributions as part of the tool's output. BioHansel is an open source Python 3 application available on PyPI and Conda repositories and as a Galaxy tool from the public Galaxy Toolshed. In a public health context, BioHansel enables rapid and high-resolution classification of bacterial pathogens with low genetic diversity.

skalo: using SKA split k-mers with coloured de Brujin graphs to genotype indels

Seamless, rapid and accurate analyses of outbreak genomic data using Split K-mer Analysis (SKA)

Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis

A Novel Approach to Detect Large Indels from Targeted Sequencing Data in Clinical Cancer Setting

mInDel: a high-throughput and efficient pipeline for genome-wide InDel marker development

Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS.

Metagenome SNP calling via read-colored de Bruijn graphs

Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering

PySNV for complex intra-host variation detection

skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements

Please Mind the Gap: Indel-Aware Parsimony for Fast and Accurate Ancestral Sequence Reconstruction and Multiple Sequence Alignment including Long Indels

Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel

K-mer analysis of long-read alignment pileups for structural variant genotyping

CLEVER: Clique-Enumerating Variant Finder

A Scalable Tool For Analyzing Genomic Variants Of Humans Using Knowledge Graphs and Machine Learning

diverse-seq: an application for alignment-free selecting and clustering biological sequences

A graph clustering algorithm for detection and genotyping of structural variants from long reads

Ksak: A high-throughput tool for alignment-free phylogenetics

Ingap-Sv: a Novel Scheme to Identify and Visualize Structural Variation from Paired End Mapping Data

Indels: computational methods, evolutionary dynamics, and biological applications