Abstract:De Bruijn graphs are essential for sequencing data analysis and must be efficiently constructed and stored for large-scale population studies. They also need to be dynamic to allow updates such as adding or removing edges and nodes. Existing dynamic implementations include DynamicBOSS and dynamicDBG. In 2018, a new family of data structures called learned indexes was introduced by Tim Kraska and Alex Beutel, with a particularly efficient implementation proposed by Paolo Ferragina and Giorgio Vinciguerra in 2020. This paper presents a new method for implementing De Bruijn graphs using learned indexes and compares its performance with current implementations. The new method shows improved time and memory efficiency for edge and node insertions, particularly with large datasets (over 110 million k-mers).

What problem does this paper attempt to address?

This paper aims to solve the problems of efficient construction and storage of de Bruijn graphs in large - scale genomic sequencing data. Specifically, the paper proposes a new method based on learned indexes to implement dynamic de Bruijn graphs and compares its performance with existing dynamic implementation methods (such as DynamicBOSS and dynamicDBG). The main goal of the research is to improve the time and memory efficiency of insertion and deletion operations when processing large - scale datasets (more than 110 million k - mers). ### Background - **Genome Assembly**: In recent years, with the development of high - throughput sequencing technology, genome assembly has become a major computational challenge in molecular biology and is recognized as an NP - hard problem. - **Traditional Methods**: Traditional long - read assembly algorithms use overlap graphs, but this method has computational limitations and short - read problems when dealing with large - scale datasets. - **de Bruijn Graphs**: In recent years, many algorithms have turned to using de Bruijn graphs, where each node represents a k - mer (a substring of length k), and edges represent exact overlaps of length k - 1. Although the construction of de Bruijn graphs is more efficient than that of overlap graphs, it still requires a large amount of memory, causing the overlap phase to become a bottleneck. - **Learned Data Structures**: Combining the latest advances in data structures and machine learning has introduced learned data structures, which use data patterns to improve space efficiency and time performance. ### Methods - **PGM - Index**: The paper chooses to use PGM - Index as the basis for the learned index, which is an efficient dynamic indexing method. - **Improvements**: - Support indexing of single - element vectors, not just key - value pairs. - Look for memory - efficient deletion methods. - Implement procedures to remove duplicate and deleted elements. - Use the KMC library for online index construction to optimize memory usage. ### Evaluation - **Datasets**: Use the E. coli K - 12 substr. MG1655 dataset for testing and generate four subsets, containing 20,000, 200,000, 2,000,000 and 14,000,000 reads respectively. - **Performance Analysis**: - **Creation**: The set - based implementation performs better in terms of memory efficiency compared to the key - value - pair - based method, and the time performance is comparable. DynamicBOSS has poor time performance on all datasets but has the best memory usage. - **Insertion**: DynamicBOSS cannot be tested on larger datasets because the execution time is too long. The implementation based on PGM - Index shows reasonable time performance. - **Deletion**: The performance trend of the deletion operation is similar to that of insertion. DynamicBOSS cannot complete the test on the two largest datasets. - **Search**: DynamicBOSS is not competitive in terms of time and memory performance. The implementation based on PGM - Index shows comparable time performance, and the single - element method is more advantageous in terms of memory usage. ### Conclusions and Future Work - **Current Results**: The proposed data structure based on the dynamic PGM - Index Set is superior to existing methods in terms of time and memory efficiency in modification operations and searches. - **Future Development Directions**: - Extend the k - mer representation to allow k values up to 255. - Implement the functions of batch insertion and deletion of k - mers. - Physically perform element modification and deletion operations. - Explore the feasibility of using the dynamic PGM - Index to represent colored de Bruijn graphs.

Implementation Of Dynamic De Bruijn Graphs Via Learned Index

Lossless Indexing with Counting de Bruijn Graphs

Lock-free de Bruijn graph

Cdbgtricks: strategies to update a compacted de bruijn graph

An Index for Sequencing Reads Based on The Colored de Bruijn Graph

Where the patterns are: repetition-aware compression for colored de Bruijn graphs

Effective indexing for dynamic structural graph clustering

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

K2R: Tinted de Bruijn Graphs implementation for efficient read extraction from sequencing datasets

Applications of de Bruijn graphs in microbiome research

De Bruijn goes Neural: Causality-Aware Graph Neural Networks for Time Series Data on Dynamic Graphs

Study of De Bruijn Graph for DNA Sequence Assembly

deGSM: memory scalable construction of large scale de Bruijn Graph

DBL: Efficient Reachability Queries on Dynamic Graphs (Complete Version)

Deep learning for dynamic graphs: models and benchmarks

Memory Efficient De Bruijn Graph Construction

BdBG: a Bucket-Based Method for Compressing Genome Sequencing Data with Dynamic De Bruijn Graphs.

Dynamic Subgraph Matching via Cost-Model-based Vertex Dominance Embeddings (Technical Report)

Prokrustean Graph: A substring index for rapid k-mer size analysis

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

Simulating the DNA String Graph in Succinct Space