Abstract:Multiple longest common subsequence (MLCS) mining (a classical NP-hard problem) is an important task in many fields. Numerous applications in these fields can generate very long sequences (i.e., the length of the sequences ≥ 10^4), called big sequences. Such big sequences present a serious challenge to existing MLCS algorithms. Although significant efforts have been made to tackle the challenge, both existing exact and approximate MLCS algorithms fail to deal with big sequences as their problem-solving model MLCS-DAG (Directed Acyclic Graph) is too large to be calculated due to the memory explosion. To bridge the gap, this paper first proposes a new identification and deletion strategy of different classes of non-critical points, which are the points that do not contribute to the solution on the MLCS-DAG. It then proposes a new MLCS problem-solving graph model, called KP-MLCS-DAG (Key Point based MLCS-DAG). A novel parallel MLCS algorithm, called KP-MLCS (Key Point based MLCS), is also presented, which can mine and compress all MLCSs of big sequences effectively and efficiently. Extensive experiments on both synthetic and real-world biological sequences show that the proposed algorithm KP-MLCS drastically outperforms the existing state-of-the-art algorithms in terms of efficiency and effectiveness. The source code of KP-MLCS and related test datasets, etc., can be found online: https://github.com/kp-mlcs/KP-MLCS.

Efficient algorithms for the longest common subsequence in $k$-length substrings

Efficient Algorithms for Finding a Longest Common Increasing Subsequence

An average-case efficient two-stage algorithm for enumerating all longest common substrings of minimum length k between genome pairs

A note on the longest common substring with $k$-mismatches problem

A sub-quadratic algorithm for the longest common increasing subsequence problem

On the Complexity of Constrained Sequences Alignment Problems.

A Fast Longest Common Subsequence Algorithm for Biosequences Alignment

A Parallel LCS Algorithm for Biosequences Alignment

A Branch Elimination-Based Efficient Algorithm for Large-Scale Multiple Longest Common Subsequence Problem

Constrained Pairwise and Center-Star Sequences Alignment Problems

Approximability of Constrained LCS

Hardness of Longest Common Subsequence for Sequences with Bounded Run-Lengths

Polynomial-time equivalences and refined algorithms for longest common subsequence variants

A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm.

Deposition and Extension Approach to Find Longest Common Subsequence for Thousands of Long Sequences

A Space-Saving Based MLCS Algorithm

Mining Multiple Longest Common Subsequence in Big Sequences

A Fast and Efficient Path Elimination Algorithm for Large-Scale Multiple Common Longest Sequence Problems

A Distributed Storage MLCS Algorithm with Time Efficient Upper Bound and Precise Lower Bound

Variants of Constrained Longest Common Subsequence

An Algorithm for the Constrained Longest Common Subsequence and Substring Problem