Abstract:The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms. The source codes of dwMLCS can be downloaded from web site https://github.com/BioLab310/dwMLCS.

FACC: A Novel Finite Automaton Based on Cloud Computing for the Multiple Longest Common Subsequences Search

Efficient Algorithms for Finding a Longest Common Increasing Subsequence

An Ultra-Compact Single FeFET Binary and Multi-Bit Associative Search Engine

A Fast Longest Common Subsequence Algorithm for Biosequences Alignment

Independent Parallel Compact Finite Automatons for Accelerating Multi-String Matching

A novel fast multiple nucleotide sequence alignment method based on FM-index

A Parallel Minimum Attribute Co-reduction Accelerator based on Quantum-inspired SFLA and MapReduce Framework

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Bwasw-Cloud: Efficient sequence alignment algorithm for two big data with MapReduce

SPC-FA: synergic parallel compact finite automaton to accelerate multi-string matching with low memory.

dwMLCS: An Efficient MLCS Algorithm based on Dynamic and Weighted Directed Acyclic Graph

Accelerated Frequent Closed Sequential Pattern Mining for Uncertain Data

A Fast Exact Pattern Matching Algorithm for Biological Sequences

Maximum Match Subsequence Alignment Algorithm Finely Grained (MMSAA FG)

Cloud Based Short Read Mapping Service

Heterogeneous Cloud Framework for Big Data Genome Sequencing.

fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

CUDAMPF++: A Proactive Resource Exhaustion Scheme for Accelerating Homologous Sequence Search on CUDA-enabled GPU

A Space-Bounded Anytime Algorithm for the Multiple Longest Common Subsequence Problem