diverse-seq: an application for alignment-free selecting and clustering biological sequences

Gavin A Huttley,Katherine Caley,Robert Neil McArthur
DOI: https://doi.org/10.1101/2024.11.10.622877
2024-11-11
Abstract:The algorithms required for phylogenetics -- multiple sequence alignment and phylogeny estimation -- are both compute intensive. As the size of DNA sequence datasets continues to increase, there is a need for a tool that can effectively lessen the computational burden associated with this widely used analysis. 'diverse-seq' implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of $k$-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. 'diverse-seq' can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances. The 'diverse-seq' algorithms are not limited to homologous sequences. As such, they can improve the performance of other workflows. For instance, machine learning projects that involve non-homologous sequences can benefit as representative sampling can mitigate biases from imbalanced groups. 'diverse-seq' is a BSD-3 licensed Python package that provides both a command-line interface and 'cogent3' plugins. The latter simplifies integration by users into their own analyses. It is available via the Python Package Index and GitHub.
Bioinformatics
What problem does this paper attempt to address?