Abstract:Hierarchical genotyping approaches can provide insights into the source, geography and temporal distribution of bacterial pathogens. Multiple hierarchical SNP genotyping schemes have previously been developed so that new isolates can rapidly be placed within pre-computed population structures, without the need to rebuild phylogenetic trees for the entire dataset. This classification approach has, however, seen limited uptake in routine public health settings due to analytical complexity and the lack of standardized tools that provide clear and easy ways to interpret results. The BioHansel tool was developed to provide an organism-agnostic tool for hierarchical SNP-based genotyping. The tool identifies split k-mers that distinguish predefined lineages in whole genome sequencing (WGS) data using SNP-based genotyping schemes. BioHansel uses the Aho-Corasick algorithm to type isolates from assembled genomes or raw read sequence data in a matter of seconds, with limited computational resources. This makes BioHansel ideal for use by public health agencies that rely on WGS methods for surveillance of bacterial pathogens. Genotyping results are evaluated using a quality assurance module which identifies problematic samples, such as low-quality or contaminated datasets. Using existing hierarchical SNP schemes for Mycobacterium tuberculosis and Salmonella Typhi, we compare the genotyping results obtained with the k-mer-based tools BioHansel and SKA, with those of the organism-specific tools TBProfiler and genotyphi, which use gold-standard reference-mapping approaches. We show that the genotyping results are fully concordant across these different methods, and that the k-mer-based tools are significantly faster. We also test the ability of the BioHansel quality assurance module to detect intra-lineage contamination and demonstrate that it is effective, even in populations with low genetic diversity. We demonstrate the scalability of the tool using a dataset of ~8100 S. Typhi public genomes and provide the aggregated results of geographical distributions as part of the tool's output. BioHansel is an open source Python 3 application available on PyPI and Conda repositories and as a Galaxy tool from the public Galaxy Toolshed. In a public health context, BioHansel enables rapid and high-resolution classification of bacterial pathogens with low genetic diversity.

Naïve Bayes Classifiers and accompanying dataset for Pseudomonas syringae isolate characterization

Classification of Isolates from the Pseudomonas fluorescens Complex into Phylogenomic Groups Based in Group-Specific Markers

A complete genome sequence for Pseudomonas syringae pv. pisi PP1 highlights the importance of multiple modes of horizontal gene transfer during phytopathogen evolution

Fungal identification using a Bayesian classifier and the Warcup training set of internal transcribed spacer sequences

SNaPaer: A Practical Single Nucleotide Polymorphism Multiplex Assay for Genotyping of Pseudomonas aeruginosa

Clarification of Taxonomic Status within the Pseudomonas syringae Species Group Based on a Phylogenomic Analysis

A machine learning algorithm for the automatic classification of Phytophthora infestans genotypes into clonal lineages

Genome-Based Taxonomy of Species in the Pseudomonas syringae and Pseudomonas lutea Phylogenetic Groups and Proposal of Pseudomonas maioricensis sp. nov., Isolated from Agricultural Soil

Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel

Machine learning identification of Pseudomonas aeruginosa strains from colony image data

Bayesian identification of bacterial strains from sequencing data

An innovative approach to decoding genetic variability in Pseudomonas aeruginosa via amino acid repeats and gene structure profiles

Unraveling the Genomic Diversity of the Pseudomonas putida Group: Exploring Taxonomy, Core Pangenome, and Antibiotic Resistance Mechanisms

Model-driven characterization of functional diversity of Pseudomonas aeruginosa clinical isolates with broadly representative phenotypes

A polyphasic strategy incorporating genomic data for the taxonomic description of novel bacterial species

Molecular epidemiology of clinically high-risk Pseudomonas aeruginosa strains: Practical overview

Secondary metabolite profiling of Pseudomonas aeruginosa isolates reveals rare genomic traits

Naïve Bayesian classifiers with multinomial models for rRNA taxonomic assignment

Use of 16S rRNA Gene for Identification of a Broad Range of Clinically Relevant Bacterial Pathogens

A Multi-class Probabilistic Neural Network for Pathogen Classification

Genomic characterization and phylogenetic analysis of the novel Pseudomonas phage PPSC2