Protein-to-genome alignment with miniprot

Heng Li
DOI: https://doi.org/10.48550/arXiv.2210.08052
2022-12-29
Abstract:Motivation: Protein-to-genome alignment is critical to annotating genes in non-model organisms. While there are a few tools for this purpose, all of them were developed over ten years ago and did not incorporate the latest advances in alignment algorithms. They are inefficient and could not keep up with the rapid production of new genomes and quickly growing protein databases. Results: Here we describe miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as k-mer sketch and SIMD-based dynamic programming. It is tens of times faster than existing tools while achieving comparable accuracy on real data. Availability and implementation: <a class="link-external link-https" href="https://github.com/lh3/miniprot" rel="external noopener nofollow">this https URL</a>
Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **the efficiency and accuracy issues of protein - to - genome alignment in gene annotation of non - model organisms**. Specifically, most of the existing protein - to - genome alignment tools were developed more than 10 years ago and failed to incorporate the latest progress in alignment algorithms, resulting in their inefficiency and inability to keep up with the rapid generation of new genomes and the rapid growth of protein databases. Therefore, these tools are not efficient enough when dealing with large amounts of data. To solve these problems, the author has developed a new alignment tool **miniprot**, which integrates the latest technologies such as k - mer sketch and SIMD - based dynamic programming, thus achieving a speed dozens of times faster than existing tools and comparable accuracy on real - data. ### Main challenges 1. **Complexity of dynamic programming**: The core of protein - to - genome alignment is a complex dynamic - programming algorithm, which needs to consider affine gap penalties, introns and frameshifts simultaneously. 2. **Splicing signal modeling**: A successful aligner needs to model splicing signals correctly like a gene predictor. 3. **Efficient implementation**: It is necessary to combine these complex algorithms with modern computing techniques to achieve efficient implementation. ### Solutions miniprot solves the above problems in the following ways: - **k - mer indexing and seed - expansion strategy**: Use k - mer indexing and seed - expansion strategy to accelerate the alignment process. - **Dynamic - programming optimization**: Adopt SIMD technology to optimize the dynamic - programming algorithm, significantly improving the alignment speed. - **Improved splicing model**: Introduce a more flexible splicing model that can better handle conservation and differences among different species. ### Results The experimental results show that miniprot performs well on multiple evaluation metrics. Especially when dealing with alignments between distantly related species, its performance is more stable, and its speed is dozens of times faster than existing tools. ### Application prospects The development of miniprot is expected to simplify the existing gene - annotation processes. Especially when closely related species are available, it can find 90% of the coding regions within a few minutes. In addition, miniprot can also be used to evaluate the quality of de novo assemblies and support gene - annotation work in large - scale genome projects. In conclusion, miniprot is a fast and accurate protein - to - genome alignment tool that can significantly improve the alignment efficiency while maintaining high accuracy.