K2R: Tinted de Bruijn Graphs implementation for efficient read extraction from sequencing datasets
Lea Vandamme,Bastien Cazaux,Antoine Limasset
DOI: https://doi.org/10.1101/2024.02.15.580442
2024-12-11
Abstract:The analysis of biological sequences often depends on reference genomes; however, achieving accurate assemblies remains a significant challenge. As a result, de novo analysis directly from raw sequencing reads, without pre-processing, is frequently a more practical approach. A common need across various applications is the ability to identify reads containing a specific k-mer within a dataset. This k-mer-to-read association is critical in multiple contexts, such as genotyping, bacterial strain resolution, profiling, data compression, error correction, and assembly. While this challenge appears similar to the extensively researched colored de Bruijn graph problem, resolving it at the read level is prohibitively resource-intensive for practical applications. In this work, we demonstrate its tractable resolution by leveraging reasonnable assumptions for genome sequencing dataset indexing. To tackle this challenge, we introduce the Tinted de Bruijn Graph concept, an altered version of the colored de Bruijn graph where each read in a sequencing dataset acts as a distinct source. We developed K2R, a highly scalable index that implements this framework efficiently. K2R's performance, in terms of index size, memory footprint, throughput, and construction time, is benchmarked against leading methods, including hashing techniques (e.g., Short Read Connector and Fulgor), full-text indexing (e.g., Movi and Themisto) across various datasets. To demonstrate K2R's scalability, we indexed two human datasets from the T2T consortium. The 126X coverage ONT dataset was indexed in 9 hours using 61GB of RAM, resulting in a 30GB index. Similarly, the 56X coverage HiFi dataset was indexed in less than 5 hours using 39GB of RAM, producing a 20.5GB index. Developed in C++, the K2R index is open-source and available on GitHub at http://github.com/LeaVandamme/K2R.
Bioinformatics