copMEM: Finding maximal exact matches via sampling both genomes

Szymon Grabowski,Wojciech Bieniecki
DOI: https://doi.org/10.48550/arXiv.1805.08816
2018-05-23
Abstract:Genome-to-genome comparisons require designating anchor points, which are given by Maximum Exact Matches (MEMs) between their sequences. For large genomes this is a challenging problem and the performance of existing solutions, even in parallel regimes, is not quite satisfactory. We present a new algorithm, copMEM, that allows to sparsely sample both input genomes, with sampling steps being coprime. Despite being a single-threaded implementation, copMEM computes all MEMs of minimum length 100 between the human and mouse genomes in less than 2 minutes, using less than 10 GB of RAM memory.
Data Structures and Algorithms,Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the comparison between genomes, how to efficiently find Maximal Exact Matches (MEMs). For large - scale genomes, the performance of existing solutions is not satisfactory even in a parallel computing environment. Specifically: 1. **Problem Background**: - In the comparison between genomes, anchor points need to be specified, and these anchor points are given by the Maximal Exact Matches (MEMs) between two sequences. - For high - throughput sequencing data, there are two basic applications for finding MEMs: 1. Provide seeds for the alignment of sequencing reads in genome assembly. 2. Specify anchor points for the comparison between genomes. 2. **Limitations of Existing Methods**: - Early algorithms were based on suffix trees or enhanced suffix arrays, but these data structures occupied a large amount of memory. - Subsequent improved methods such as essaMEM and E - MEM, although more compact and faster, still had performance bottlenecks when dealing with very large genomes. - Fixed - sampling and minimizer sampling each have their own advantages and disadvantages, and fixed - sampling is generally considered a better choice because it occupies less space. 3. **The New Method Proposed in the Paper**: - The paper proposes a new algorithm copMEM, which sparsely samples two input genomes by using coprime numbers. - Specifically, copMEM selects two positive integer parameters \( k_1 \) and \( k_2 \), such that \( \gcd(k_1, k_2)=1 \), and \( k_1\times k_2\leq L - K + 1 \). - This method allows sampling two genomes with a step size greater than 1, and can quickly find all MEMs with a length of at least 100 in a single - thread implementation. 4. **Performance Advantages**: - Although it is a single - thread implementation, copMEM can complete the task within less than 2 minutes when processing the human and mouse genomes, and the memory usage is less than 10 GB. - The experimental results show that copMEM is an order of magnitude faster than other competing algorithms (such as essaMEM and E - MEM), and is also competitive in terms of memory usage. In summary, this paper aims to significantly improve the efficiency and speed of finding MEMs in the comparison between large - scale genomes by introducing a new algorithm copMEM based on coprime number sampling.