A Novel Fast And Memory Efficient Parallel Mlcs Algorithm For Long And Large-Scale Sequences Alignments
Yanni Li,Yuping Wang,Zhensong Zhang,Yaxin Wang,Ding Ma,Jianbin Huang
DOI: https://doi.org/10.1109/ICDE.2016.7498322
2016-01-01
Abstract:Information usually can be abstracted as a character sequence over a finite alphabet. With the advent of the era of big data, the increasing length and size of the sequences from various application fields (e.g., biological sequences) result in the classical NP-hard problem, searching for the Multiple Longest Common Subsequences of multiple sequences (i.e., MLCS problem with many applications in the areas of bioinformatics, computational genomics, pattern recognition, etc.), becoming a research hotspot and facing severe challenges. In this paper, we firstly reveal that the leading dominant-point-based MLCS algorithms are very hard to apply to long and large-scale sequences alignments. To overcome their defects, based on the proposed problem-solving model and parallel topological sorting strategies, we present a novel efficient parallel MLCS algorithm. The comprehensive experiments on the benchmark datasets of both random and biological sequences demonstrate that both the time and space complexities of the proposed algorithm are only linearly related to the dominants from aligned sequences, and that the proposed algorithm greatly outperforms the existing state-of-the-art dominant-point-based MLCS algorithms, and hence it is very suitable for long and large-scale sequences alignments.