Abstract:Genome-to-genome comparisons require designating anchor points, which are given by Maximum Exact Matches (MEMs) between their sequences. For large genomes this is a challenging problem and the performance of existing solutions, even in parallel regimes, is not quite satisfactory. We present a new algorithm, copMEM, that allows to sparsely sample both input genomes, with sampling steps being coprime. Despite being a single-threaded implementation, copMEM computes all MEMs of minimum length 100 between the human and mouse genomes in less than 2 minutes, using less than 10 GB of RAM memory.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the comparison between genomes, how to efficiently find Maximal Exact Matches (MEMs). For large - scale genomes, the performance of existing solutions is not satisfactory even in a parallel computing environment. Specifically: 1. **Problem Background**: - In the comparison between genomes, anchor points need to be specified, and these anchor points are given by the Maximal Exact Matches (MEMs) between two sequences. - For high - throughput sequencing data, there are two basic applications for finding MEMs: 1. Provide seeds for the alignment of sequencing reads in genome assembly. 2. Specify anchor points for the comparison between genomes. 2. **Limitations of Existing Methods**: - Early algorithms were based on suffix trees or enhanced suffix arrays, but these data structures occupied a large amount of memory. - Subsequent improved methods such as essaMEM and E - MEM, although more compact and faster, still had performance bottlenecks when dealing with very large genomes. - Fixed - sampling and minimizer sampling each have their own advantages and disadvantages, and fixed - sampling is generally considered a better choice because it occupies less space. 3. **The New Method Proposed in the Paper**: - The paper proposes a new algorithm copMEM, which sparsely samples two input genomes by using coprime numbers. - Specifically, copMEM selects two positive integer parameters \( k_1 \) and \( k_2 \), such that \( \gcd(k_1, k_2)=1 \), and \( k_1\times k_2\leq L - K + 1 \). - This method allows sampling two genomes with a step size greater than 1, and can quickly find all MEMs with a length of at least 100 in a single - thread implementation. 4. **Performance Advantages**: - Although it is a single - thread implementation, copMEM can complete the task within less than 2 minutes when processing the human and mouse genomes, and the memory usage is less than 10 GB. - The experimental results show that copMEM is an order of magnitude faster than other competing algorithms (such as essaMEM and E - MEM), and is also competitive in terms of memory usage. In summary, this paper aims to significantly improve the efficiency and speed of finding MEMs in the comparison between large - scale genomes by introducing a new algorithm copMEM based on coprime number sampling.

copMEM: Finding maximal exact matches via sampling both genomes

Fast Detection of Maximal Exact Matches Via Fixed Sampling of Queryk-Mers and Bloom Filtering of Indexk-Mers

How to Find Long Maximal Exact Matches and Ignore Short Ones

MEM-based pangenome indexing for k-mer queries

Gene Sequence Alignment on a Public Computing Platform

MONI: A Pangenomic Index for Finding Maximal Exact Matches

Short Read Alignment Based On Maximal Approximate Match Seeds

Accelerating spliced alignment of long RNA sequencing reads using parallel maximal exact match retrieval

Computing Maximal Unique Matches with the r-index

Allowing mutations in maximal matches boosts genome compression performance

Indexing large genome collections on a PC

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

MaxSSmap: A GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence

PARMIK: PArtial Read Matching with Inexpensive K-mers

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Perm: Efficient Mapping of Short Sequencing Reads with Periodic Full Sensitive Spaced Seeds

GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery

EGM: encapsulated gene-by-gene matching to identify gene orthologs and homologous segments in genomes

Minimap2: pairwise alignment for nucleotide sequences

CASA: An Energy-Efficient and High-Speed CAM-based SMEM Seeding Accelerator for Genome Alignment

Analyzing large-scale DNA Sequences on Multi-core Architectures