Abstract:Modern pangenome graphs are built using haplotype-resolved genome assemblies. During read mapping to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes has been shown to improve genotyping accuracy. However, the existing rigorous formulations for sequence-to-graph co-linear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. In this paper, we develop novel formulations and algorithms for haplotype-aware sequence alignment to an acyclic pangenome graph. We consider both sequence-to-graph chaining and sequence-to-graph alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, we extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve haplotype-aware sequence-to-graph alignment in (| | | | |ℋ|) time, where is the query sequence, is the set of edges, and ℋ is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than (| | | | |ℋ|) is impossible under the Strong Exponential Time Hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in (|ℋ| log |ℋ| ) time after graph preprocessing, where is the count of input anchors. We then establish that a chaining algorithm significantly faster than (|ℋ| ) is impossible under SETH. As a proof-of-concept of our algorithmic solutions, we implemented the chaining algorithm in the Minichain aligner ( ). We demonstrate the advantage of the algorithm by aligning sequences sampled from human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes. The proposed algorithm offers better consistency with ground-truth recombinations when compared to a haplotype-agnostic algorithm.

Building a pangenome alignment index via recursive prefix-free parsing

Personalized pangenome references

DNA sequences alignment method using sparse index on pan-genome graph

Unbiased pangenome graphs

MONI: A Pangenomic Index for Finding Maximal Exact Matches

Prefix-free parsing for building large tunnelled Wheeler graphs

Prefix-free graphs and suffix array construction in sublinear space

Acceleration of FM-index Queries Through Prefix-free Parsing

MEM-based pangenome indexing for k-mer queries

Integer programming framework for pangenome-based genome inference

Haplotype-aware sequence alignment to pangenome graphs

SAFARI: Pangenome Alignment of Ancient DNA Using Purine/Pyrimidine Encodings

PPanG: a precision pangenome browser enabling nucleotide-level analysis of genomic variations in individual genomes and their graph-based pangenome

Building pangenome graphs

CGAP-align: a high performance DNA short read alignment tool.

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes

Revisiting pangenome openness with -mers

PanGraph: scalable bacterial pan-genome graph construction

Fast Read Alignment with Incorporation of Known Genomic Variants.

Efficient inference of large prokaryotic pangenomes with PanTA