Abstract:Motivation: Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods. Results: In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium.

Seqminer2: an efficient tool to query and retrieve genotypes for statistical genetics analyses from biobank scale sequence dataset

NCBIminer: Sequences Harvest from Genbank

iSeq: An integrated tool to fetch public sequencing data

2FAST2Q: a general-purpose sequence search and counting program for FASTQ files

Efficient storage and regression computation for population-scale genome sequencing studies

Second-generation PLINK: rising to the challenge of larger and richer datasets

FANSe2: a robust and cost-efficient alignment tool for quantitative next-generation sequencing applications.

BRGenomics for analyzing high-resolution genomics data in R

Reference Sequence Browser: An R application with a User-Friendly GUI to rapidly query sequence databases

CleanBSequences: an efficient curator of biological sequences in R

Ultrafast functional profiling of RNA-seq data for nonmodel organisms

diverse-seq: an application for alignment-free selecting and clustering biological sequences

SeqKit2: A Swiss army knife for sequence and alignment processing

RabbitQC: high-speed scalable quality control for sequencing data

GeneMiner: A tool for extracting phylogenetic markers from next‐generation sequencing data

BSAseq: an interactive and integrated web-based workflow for identification of causal mutations in bulked F2 populations

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index

Cross-Species Application of Illumina iScan Microarrays for Cost-Effective, High-Throughput SNP Discovery

DNAscan: a fast, computationally and memory efficient bioinformatics pipeline for the analysis of DNA next-generation-sequencing data

Efficient Seeding for Error-Prone Sequences with SubseqHash2