Genomic Distance-based Rapid Uncovering of Microbial Population Structures (GRUMPS): a reference free genomic data cleaning methodology

Kaleb Z. Abram,Zulema Udaondo,Michael S. Robeson,Se-Ran Jun
DOI: https://doi.org/10.1101/2022.12.19.521123
2024-09-09
Abstract:Accurate datasets are crucial for rigorous large-scale sequence-based analyses such as those performed in phylogenomics and pangenomics. As the volume of available sequence data grows and the quality of these sequences varies, there is a pressing need for reliable methods to swiftly identify and eliminate low-quality and misidentified genomes from datasets prior to analysis. Here we introduce a robust, controlled, computationally efficient method for deriving species-level population structures of bacterial species, regardless of the dataset size. Additionally, our pipeline can classify genomes into their respective species at the genus level. By leveraging this methodology, researchers can rapidly clean datasets encompassing entire bacterial species and examine the sub-species population structures within the provided genomes. These cleaned datasets can subsequently undergo further refinement using a variety of methods to yield sequence sets with varying levels of diversity that faithfully represent entire species. Increasing the efficiency and accuracy of curation of species-level datasets not only enhances the reliability of downstream analyses, but also facilitates a deeper understanding of bacterial population dynamics and evolution.
Bioinformatics
What problem does this paper attempt to address?
The paper attempts to address the problem of how to quickly identify and remove low-quality or erroneously labeled genome sequences in large-scale genomic data analysis to improve the quality and consistency of datasets. Specifically, with the development of sequencing technology, the availability of a large amount of bacterial genome sequence data has rapidly increased, but these datasets may contain sampling biases and sequence quality issues. These problems can affect the accuracy and reliability of subsequent analyses, especially for comparative genomics, pangenome studies, taxonomy, and core gene multilocus sequence typing analysis that rely on gene presence thresholds. To address these issues, researchers have developed a method called GRUMPS (Genome-based Rapid Uncovering of Microbial Population Structure). GRUMPS is a Python program designed to automatically detect and exclude outlier genomes in datasets through statistical analysis and unsupervised machine learning algorithms, enabling rapid and reproducible cleaning of bacterial datasets of any scale. Additionally, GRUMPS can separate multiple species at the species level and further refine high-quality bacterial species datasets, aiding in a better understanding of bacterial population dynamics and evolution.