Abstract:The algorithms required for phylogenetics -- multiple sequence alignment and phylogeny estimation -- are both compute intensive. As the size of DNA sequence datasets continues to increase, there is a need for a tool that can effectively lessen the computational burden associated with this widely used analysis. 'diverse-seq' implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of $k$-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. 'diverse-seq' can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances. The 'diverse-seq' algorithms are not limited to homologous sequences. As such, they can improve the performance of other workflows. For instance, machine learning projects that involve non-homologous sequences can benefit as representative sampling can mitigate biases from imbalanced groups. 'diverse-seq' is a BSD-3 licensed Python package that provides both a command-line interface and 'cogent3' plugins. The latter simplifies integration by users into their own analyses. It is available via the Python Package Index and GitHub.

diverse-seq: an application for alignment-free selecting and clustering biological sequences

Ksak: A high-throughput tool for alignment-free phylogenetics

SaAlign: Multiple DNA/RNA Sequence Alignment and Phylogenetic Tree Construction Tool for Ultra-Large Datasets and Ultra-Long Sequences Based on Suffix Array

ProSeq4: A user‐friendly multiplatform program for preparation and analysis of large‐scale DNA polymorphism datasets

PhyloSuite: An integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies

Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of K-Mer

Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection

Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets

Sample-Align-D: A High Performance Multiple Sequence Alignment System using Phylogenetic Sampling and Domain Decomposition

OrthoPhyl – Streamlining large scale, orthology-based phylogenomic studies of bacteria at broad evolutionary scales

Co-Phylog: an Assembly-Free Phylogenomic Approach for Closely Related Organisms

Analysis of Phylogeny Tracking Algorithms for Serial and Multiprocess Applications

PyamilySeq: A Python Tool for Interpretable Gene (Re)Clustering and Pangenomic Inference Across Species and Genera

High-resolution microbial community reconstruction by integrating short reads from multiple 16S rRNA regions

MetaSMC: a Coalescent-Based Shotgun Sequence Simulator for Evolving Microbial Populations

CleanBSequences: an efficient curator of biological sequences in R

PhyloAln: a convenient reference-based tool to align sequences and high-throughput reads for phylogeny and evolution in the omic era

CD-HIT: accelerated for clustering the next-generation sequencing data

VEHoP: A Versatile, Easy-to-use, and Homology-based Phylogenomic pipeline accommodating diverse sequences

Real-time Taxonomic Characterization of Long-read Mixed-species Sequencing Samples in Sorted Motif Distance Space: