pathMap: a path-based mapping tool for long noisy reads with high sensitivity

Ze-Gang Wei,Xiao-Dan Zhang,Xing-Guo Fan,Yu Qian,Fei Liu,Fang-Xiang Wu
DOI: https://doi.org/10.1093/bib/bbae107
IF: 9.5
2024-03-23
Briefings in Bioinformatics
Abstract:With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k -mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k -mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficient mapping of long - read and high - error - rate sequence reads on the reference genome in single - molecule sequencing (SMS) technology. Specifically, with the development of SMS technology, the generated read lengths are getting longer and longer, but at the same time, the error rates of these reads are also relatively high, which poses a challenge to sequence alignment. Traditional mapping tools often perform poorly in sensitivity when dealing with such long and noisy reads, that is, they are unable to detect more mapping regions or obtain more aligned bases, which is unfavorable for downstream analysis. To meet this challenge, this study proposes a new path - based long - read mapping tool - pathMap. pathMap transforms the chain construction problem into a path selection problem in the graph by regarding the alignment chain as a path in the matching k - mer graph. It can iteratively find the longest path among the remaining nodes, thereby effectively detecting more high - quality candidate chains and ultimately improving the mapping sensitivity. Experimental results show that, compared with existing advanced mapping methods such as minimap2 and Winnowmap2, pathMap exhibits higher mapping sensitivity on both simulated and real - data sets, can detect more mapping chains and aligned bases, and performs more prominently especially on high - error - rate data sets. In addition, pathMap is also more robust for sequence errors and species - specific pathogen identification.