Abstract:Motivation: Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods. Results: In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium.

Minimizing Reference Bias with an Impute-First Approach

Personalized pangenome references

Integer programming framework for pangenome-based genome inference

On Combining Reference Data to Improve Imputation Accuracy

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes

Personalizing pangenome graphs with k -mers

One Size Doesn't Fit All - RefEditor: Building Personalized Diploid Reference Genome to Improve Read Mapping and Genotype Calling in Next Generation Sequencing Studies.

Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index

Unravelling reference bias in ancient DNA datasets

From the reference human genome to human pangenome: Premise, promise and challenge

Benchmarking Imputed Low Coverage Genomes in a Human Population Genetics Context

A Novel Multi-Alignment Pipeline for High-Throughput Sequencing Data.

Haplotype-aware sequence alignment to pangenome graphs

Read Annotation Pipeline for High-Throughput Sequencing Data.

FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313

Building a pangenome alignment index via recursive prefix-free parsing

Genotype Imputation and Reference Panel: a Systematic Evaluation on Haplotype Size and Diversity.

A resampling-based approach to share reference panels

Comprehensive Assessment of Genotype Imputation Performance.

Genotype imputation using the Positional Burrows Wheeler Transform

Parametric Alignment of Drosophila Genomes