VACmap: an accurate long-read aligner for unraveling complex structural variations

Hongyu Ding,Zhirui Liao,Shanfeng Zhu
DOI: https://doi.org/10.1101/2023.08.03.551566
2023-01-01
Abstract:Abstract Recent advancements in whole-genome sequencing have unveiled the remarkable diversity and abundance of complex structural variations, surpassing previous perceptions. Unfortunately, these complex variations often remain understudied and underappreciated in genetic research due to inherent technical challenges. Existing mapping methods often failed to align long reads containing complex structural variations, as they tend to align the long reads by maximizing a colinear score function between the long reads and the reference sequence. However, in the presence of structural variations, the long read can only represent in a non-linear manner and consequently colinear mapping is deemed to result in incorrect and incomplete alignments. To address this critical issue, we have developed VACmap—an accurate mapping method specialized in mapping long reads containing complex structural variations. VACmap incorporates a novel variant-aware chaining algorithm, which effectively identifies the globally optimal non-linear alignment for each long read. This algorithmic approach ensures the faithful representation of both simple and complex structural variations, thus producing correct alignments. Our experimental findings confirm that VACmap significantly enhances the performance of downstream simple and complex structural variation detection, allowing for reliable and precise use of the signals in long reads. Notably, VACmap greatly improves the discovery rate of duplication in downstream detection which existing methods often misalign them as insertion. Additionally, VACmap exhibits remarkable robustness in accurately identifying structural variations within repetitive genomic regions. These regions, which have traditionally posed challenges for mapping techniques, are effectively addressed by VACmap.VACmap overcomes a longstanding hurdles in aligning long-read with structural variations and we believe it will be widely embraced in various downstream analysis.
What problem does this paper attempt to address?