RabbitTClust: Enabling Fast Clustering Analysis of Millions of Bacteria Genomes with MinHash Sketches.

Xiaoming Xu,Zekun Yin,Lifeng Yan,Hao Zhang,Borui Xu,Yanjie Wei,Beifang Niu,Bertil Schmidt,Weiguo Liu
DOI: https://doi.org/10.1186/s13059-023-02961-6
IF: 17.906
2023-01-01
Genome Biology
Abstract:We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.
What problem does this paper attempt to address?