Prokrustean Graph: A substring index for rapid k-mer size analysis

Adam Park,David Koslicki
DOI: https://doi.org/10.1101/2023.11.21.568151
2024-09-15
Abstract:Despite the widespread adoption of -mer-based methods in bioinformatics, understanding the influence of -mer sizes remains a persistent challenge. Selecting an optimal -mer size or employing multiple -mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of -mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined -mer-based object like Jaccard Similarity, de Bruijn graphs, -mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of -mer sizes, the dynamics of -mer-based objects with respect to -mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of -mer-based objects across -mer sizes, utilizing a novel substring index, the Pro rustean graph. The primary contribution of this framework is to compute quantities associated with -mer-based objects for all -mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of -mer sizes. For example, counting vertices of compacted de Bruijn graphs for = 1, …, 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Pro rustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying -mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at .
Genomics
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily addresses key challenges encountered when using k-mer methods in bioinformatics, particularly the arbitrariness of k-mer size selection and computational complexity issues. #### Main Issues 1. **k-mer Size Selection Problem**: Although k-mer size significantly impacts the results, in practical applications, choosing the optimal k-mer size is often arbitrary and lacks formal guidance methods. 2. **Computational Burden of Multiple k-mer Sizes**: While using multiple k-mer sizes can improve accuracy, the computational cost increases sharply with each additional k-mer size. #### Solutions The paper proposes a new computational framework—Prokrustean Graph—for rapid conversion of k-mer-based objects across different k-mer sizes. Specifically: - **Prokrustean Graph**: This is a novel substring indexing structure that can efficiently compute various k-mer-related quantities, such as the number of unique k-mers under different k-mer sizes, Jaccard similarity, etc. - **Algorithm Efficiency**: The time complexity of this framework depends only on the number of maximum repeat sequences and is independent of the k-mer size range, significantly reducing computational costs. - **Implementation Applications**: The authors implemented four key applications in modern pangenomics and metagenomics, utilizing the Prokrustean Graph to efficiently handle k-mer-related data. Through this method, researchers can more systematically explore k-mer-based objects under different k-mer sizes, thereby better understanding the impact of k-mer size on bioinformatics tasks.