Implementation Of Dynamic De Bruijn Graphs Via Learned Index

Riccardo Nigrelli
2024-06-18
Abstract:De Bruijn graphs are essential for sequencing data analysis and must be efficiently constructed and stored for large-scale population studies. They also need to be dynamic to allow updates such as adding or removing edges and nodes. Existing dynamic implementations include DynamicBOSS and dynamicDBG. In 2018, a new family of data structures called learned indexes was introduced by Tim Kraska and Alex Beutel, with a particularly efficient implementation proposed by Paolo Ferragina and Giorgio Vinciguerra in 2020. This paper presents a new method for implementing De Bruijn graphs using learned indexes and compares its performance with current implementations. The new method shows improved time and memory efficiency for edge and node insertions, particularly with large datasets (over 110 million k-mers).
Data Structures and Algorithms
What problem does this paper attempt to address?
This paper aims to solve the problems of efficient construction and storage of de Bruijn graphs in large - scale genomic sequencing data. Specifically, the paper proposes a new method based on learned indexes to implement dynamic de Bruijn graphs and compares its performance with existing dynamic implementation methods (such as DynamicBOSS and dynamicDBG). The main goal of the research is to improve the time and memory efficiency of insertion and deletion operations when processing large - scale datasets (more than 110 million k - mers). ### Background - **Genome Assembly**: In recent years, with the development of high - throughput sequencing technology, genome assembly has become a major computational challenge in molecular biology and is recognized as an NP - hard problem. - **Traditional Methods**: Traditional long - read assembly algorithms use overlap graphs, but this method has computational limitations and short - read problems when dealing with large - scale datasets. - **de Bruijn Graphs**: In recent years, many algorithms have turned to using de Bruijn graphs, where each node represents a k - mer (a substring of length k), and edges represent exact overlaps of length k - 1. Although the construction of de Bruijn graphs is more efficient than that of overlap graphs, it still requires a large amount of memory, causing the overlap phase to become a bottleneck. - **Learned Data Structures**: Combining the latest advances in data structures and machine learning has introduced learned data structures, which use data patterns to improve space efficiency and time performance. ### Methods - **PGM - Index**: The paper chooses to use PGM - Index as the basis for the learned index, which is an efficient dynamic indexing method. - **Improvements**: - Support indexing of single - element vectors, not just key - value pairs. - Look for memory - efficient deletion methods. - Implement procedures to remove duplicate and deleted elements. - Use the KMC library for online index construction to optimize memory usage. ### Evaluation - **Datasets**: Use the E. coli K - 12 substr. MG1655 dataset for testing and generate four subsets, containing 20,000, 200,000, 2,000,000 and 14,000,000 reads respectively. - **Performance Analysis**: - **Creation**: The set - based implementation performs better in terms of memory efficiency compared to the key - value - pair - based method, and the time performance is comparable. DynamicBOSS has poor time performance on all datasets but has the best memory usage. - **Insertion**: DynamicBOSS cannot be tested on larger datasets because the execution time is too long. The implementation based on PGM - Index shows reasonable time performance. - **Deletion**: The performance trend of the deletion operation is similar to that of insertion. DynamicBOSS cannot complete the test on the two largest datasets. - **Search**: DynamicBOSS is not competitive in terms of time and memory performance. The implementation based on PGM - Index shows comparable time performance, and the single - element method is more advantageous in terms of memory usage. ### Conclusions and Future Work - **Current Results**: The proposed data structure based on the dynamic PGM - Index Set is superior to existing methods in terms of time and memory efficiency in modification operations and searches. - **Future Development Directions**: - Extend the k - mer representation to allow k values up to 255. - Implement the functions of batch insertion and deletion of k - mers. - Physically perform element modification and deletion operations. - Explore the feasibility of using the dynamic PGM - Index to represent colored de Bruijn graphs.