Seamless, rapid and accurate analyses of outbreak genomic data using Split K-mer Analysis (SKA)

Romain Derelle,Johanna von Wachsmann,Tommi Mäklin,Joel Hellewell,Timothy Russell,Ajit Lalvani,Leonid Chindelevitch,Nicholas J. Croucher,Simon R. Harris,John A. Lees

DOI: https://doi.org/10.1101/2024.03.25.586631

2024-03-29

Abstract:Sequence variation observed in populations of pathogens can be used for important public health and evolution genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. Additionally, while the volume of bacterial genomes continues to grow, tools which can accurately and quickly call genetic variation between sequences have not kept pace. There is a need for tools which can process this large volume of data, providing rapid results, but remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long term storage and processing of large files. Here we describe Split K-mer Analysis (SKA2), a method which supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations we show superior variant recall compared to reference-based methods, with no false positives. We also show that within bacterial strains, where it is possible to construct a clonal frame, SKA2 can also accurately map variants to a reference, and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyse entire collections. Given its robust implementation, inherent absence of reference bias and high accuracy, SKA2 has the potential to become the tool of choice for genotyping bacteria and can help expand the uses of genome data in evolutionary and epidemiological analyses. SKA2 is implemented in Rust and is freely available at .

Bioinformatics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to quickly and accurately identify sequence variations in pathogen genome data analysis, especially in the application of epidemic outbreak analysis and transmission reconstruction. Traditional methods based on reference genome alignment have problems such as reference bias and the need for complex filtering steps. Moreover, with the increase in the amount of bacterial genome data, existing tools are inefficient in processing large amounts of data and require highly specialized bioinformatics knowledge and expensive data analysis resources. Therefore, researchers have developed the Split K - mer Analysis (SKA2) method, aiming to provide a tool that does not require a reference genome and can perform genotyping quickly and accurately, so as to simplify and accelerate the processing of large - scale genome data, enabling non - professional bioinformaticians to easily use these tools for research. Specifically, SKA2 directly compares the variations between samples by using split k - mers (i.e., k - mers whose middle positions can change), avoiding the mapping and variation calling steps in traditional methods, thereby reducing reference bias and improving processing speed and accuracy. This method is particularly suitable for high - diversity situations in bacterial populations, and can achieve rapid and accurate genotyping without relying on high - performance computing resources, which is helpful for rapid response to epidemic outbreaks and the formulation of public health intervention measures.

Seamless, rapid and accurate analyses of outbreak genomic data using Split K-mer Analysis (SKA)

Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis

skalo: using SKA split k-mers with coloured de Brujin graphs to genotype indels

Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel

Analytical Performance Validation of Next-Generation Sequencing Based Clinical Microbiology Assays Using a K-mer Analysis Workflow

Rapid SARS-CoV-2 surveillance using clinical, pooled, or wastewater sequence as a sensor for population change

Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs

Reference-free structural variant detection in microbiomes via long-read co-assembly graphs

BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers

Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data

Accurate bacterial outbreak tracing with Oxford Nanopore sequencing and reduction of methylation-induced errors

A genomic surveillance framework and genotyping tool for Klebsiella pneumoniae and its related species complex

diverse-seq: an application for alignment-free selecting and clustering biological sequences

KPop: Accurate and scalable comparative analysis of microbial genomes by sequence embeddings

Ksak: A high-throughput tool for alignment-free phylogenetics

Multiple genome analytics framework: The case of all SARS-CoV-2 complete variants

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping

refMLST: reference-based multilocus sequence typing enables universal bacterial typing

MetaSMC: a Coalescent-Based Shotgun Sequence Simulator for Evolving Microbial Populations

A program for real-time surveillance of SARS-CoV-2 genetics

Augur: a bioinformatics toolkit for phylogenetic analyses of human pathogens