The De Bruijn Graph Sequence Mapping Problem with Changes in the Graph

Lucas B. Rocha,Said Sadique Adi,Eloi Araujo
DOI: https://doi.org/10.1101/2024.02.15.580401
2024-06-01
Abstract:In computational biology, mapping a sequence onto a sequence graph G poses a significant challenge. One possible approach to tackling this problem is to find a walk p in G that spells a sequence most similar to . This challenge is formally known as the Graph Sequence Mapping Problem (GSMP). In this paper, we delve into an alternative problem formulation known as the De Bruijn Graph Sequence Mapping Problem (BSMP). Both problems have three variants: changes only in the sequence, changes in the graph, and changes in both the sequence and the graph. We concentrate on addressing the variant involving changes in the graph. In the literature, when this problem does not allow the De Bruijn graph to induce new arcs after changes, it becomes NP-complete, as proven by Gibney et.al.. However, we reformulate the problem by considering the characteristics of the arcs induced in the De Bruijn graph. This reformulation alters the problem definition, thereby enabling the application of a polynomial-time algorithm for its resolution. Approaching the problem with this arc-inducing characteristic is pioneering, and the algorithm proposed in this work is groundbreaking in the literature.
Bioinformatics
What problem does this paper attempt to address?
This paper mainly discusses a specific variant of the De Bruijn Graph Sequence Mapping Problem (BSMP) where only the graph structure changes. In bioinformatics, comparing a sequence to another sequence graph is an important task. Usually, the comparison is performed between a reference sequence and a given sequence, but the reference sequence may not cover all potential variations. To address this problem, sequence graphs such as De Bruijn graphs can be used to represent multiple sequences. A De Bruijn graph is a graph in which each node is labeled with a specific length k sequence, and there is an edge between two nodes if their suffix and prefix have k-1 characters in common. The paper focuses on finding a path in the De Bruijn graph that has the highest similarity to a given sequence, allowing for up to d differences. Previous research has shown that the problem becomes NP-complete when no new edges are allowed to be induced in the De Bruijn graph after variations. However, this paper proposes a new problem definition that allows for inducing new edges when the graph structure changes. Using this approach, the authors were able to develop a polynomial-time algorithm to solve this variant of the problem, which was not previously addressed in the literature. The paper first introduces the relevant concepts such as sequences, distances, graphs, and matching. Then, it describes in detail the concept of editing De Bruijn graphs, defines a new problem that allows for changes in the graph structure, and demonstrates how to relate this problem to the maximum minimum cost matching problem to find the optimal solution. Finally, the paper proposes an algorithm called "De Bruijn Graph Mapping Tool and Graph Changes" (BMTC) to solve this problem and showcases the execution time and cost of the algorithm in experiments. In conclusion, the paper addresses the sequence mapping problem considering graph structural changes in De Bruijn graphs, and proposes a novel algorithm that utilizes the Hungarian algorithm to find the optimal solution, solving a problem that was previously considered NP-complete.