Flowtigs: safety in flow decompositions for assembly graphs

Francisco Sena,Eliel Ingervo,Shahbaz Khan,Andrey Prjibelski,Sebastian Schmidt,Alexandru I. Tomescu
DOI: https://doi.org/10.1101/2023.11.17.567499
2024-02-14
Abstract:A of a network flow is a set of weighted paths whose superposition equals the flow. The problem of characterising and computing safe walks for flow decompositions has so far seen only a partial solution by restricting the flow decomposition to consist of paths, and the graph to be directed and acyclic ( ). However, the problem of decomposing into closed walks in a general graph (allowing cycles) is still open. In this paper, we give a simple and linear-time-verifiable complete characterisation ( ) of walks that are in such general flow decompositions, i.e. that are subwalks of any possible flow decomposition. Our characterisation generalises over the previous one for DAGs, using a more involved proof of correctness that works around various issues introduced by cycles. We additionally provide an optimal ( )-time algorithm that identifies all maximal flowtigs and represents them inside a compact structure. We also implement this algorithm and show that it is very fast in practice. On the practical side, we study flowtigs in the use-case of metagenomic assembly. By using the species abundances as flow values of the metagenomic assembly graph, we can model the possible assembly solutions as flow decompositions into weighted closed walks. Compared to reporting unitigs or maximal safe walks based only on the graph structure ( ), reporting flowtigs results in a notably more contiguous assembly. Specifically, on shorter contigs (75-percentile), we get an improvement in assembly contiguity of up to 99% over unitigs, and on the 50-percentile of contiguity we get an improvement of up to 17% over unitigs. These improvements that flowtigs bring over unitigs are 4–14× larger that what structural contigs bring over unitigs.
Bioinformatics
What problem does this paper attempt to address?