An average-case efficient two-stage algorithm for enumerating all longest common substrings of minimum length k between genome pairs

Mattia Prosperi,Simone Marini,Christina Boucher
DOI: https://doi.org/10.1109/ichi61247.2024.00020
Abstract:A problem extension of the longest common substring (LCS) between two texts is the enumeration of all LCSs given a minimum length k (ALCS- k ), along with their positions in each text. In bioinformatics, an efficient solution to the ALCS- k for very long texts -genomes or metagenomes- can provide useful insights to discover genetic signatures responsible for biological mechanisms. The ALCS- k problem has two additional requirements compared to the LCS problem: one is the minimum length k , and the other is that all common strings longer than k must be reported. We present an efficient, two-stage ALCS- k algorithm exploiting the spectrum of text substrings of length k ( k -mers). Our approach yields a worst-case time complexity loglinear in the number of k -mers for the first stage, and an average-case loglinear in the number of common k -mers for the second stage (several orders of magnitudes smaller than the total k -mer spectrum). The space complexity is linear in the first phase (disk-based), and on average linear in the second phase (disk- and memory-based). Tests performed on genomes for different organisms (including viruses, bacteria and animal chromosomes) show that run times are consistent with our theoretical estimates; further, comparisons with MUMmer4 show an asymptotic advantage with divergent genomes.
What problem does this paper attempt to address?