Seamless, rapid and accurate analyses of outbreak genomic data using Split K-mer Analysis (SKA)

Romain Derelle,Johanna von Wachsmann,Tommi Mäklin,Joel Hellewell,Timothy Russell,Ajit Lalvani,Leonid Chindelevitch,Nicholas J. Croucher,Simon R. Harris,John A. Lees
DOI: https://doi.org/10.1101/2024.03.25.586631
2024-03-29
Abstract:Sequence variation observed in populations of pathogens can be used for important public health and evolution genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. Additionally, while the volume of bacterial genomes continues to grow, tools which can accurately and quickly call genetic variation between sequences have not kept pace. There is a need for tools which can process this large volume of data, providing rapid results, but remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long term storage and processing of large files. Here we describe Split K-mer Analysis (SKA2), a method which supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations we show superior variant recall compared to reference-based methods, with no false positives. We also show that within bacterial strains, where it is possible to construct a clonal frame, SKA2 can also accurately map variants to a reference, and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyse entire collections. Given its robust implementation, inherent absence of reference bias and high accuracy, SKA2 has the potential to become the tool of choice for genotyping bacteria and can help expand the uses of genome data in evolutionary and epidemiological analyses. SKA2 is implemented in Rust and is freely available at .
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to quickly and accurately identify sequence variations in pathogen genome data analysis, especially in the application of epidemic outbreak analysis and transmission reconstruction. Traditional methods based on reference genome alignment have problems such as reference bias and the need for complex filtering steps. Moreover, with the increase in the amount of bacterial genome data, existing tools are inefficient in processing large amounts of data and require highly specialized bioinformatics knowledge and expensive data analysis resources. Therefore, researchers have developed the Split K - mer Analysis (SKA2) method, aiming to provide a tool that does not require a reference genome and can perform genotyping quickly and accurately, so as to simplify and accelerate the processing of large - scale genome data, enabling non - professional bioinformaticians to easily use these tools for research. Specifically, SKA2 directly compares the variations between samples by using split k - mers (i.e., k - mers whose middle positions can change), avoiding the mapping and variation calling steps in traditional methods, thereby reducing reference bias and improving processing speed and accuracy. This method is particularly suitable for high - diversity situations in bacterial populations, and can achieve rapid and accurate genotyping without relying on high - performance computing resources, which is helpful for rapid response to epidemic outbreaks and the formulation of public health intervention measures.