Abstract:BACKGROUND:The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously.RESULTS:This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software.CONCLUSION:CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .

Sample-Align-D: A High Performance Multiple Sequence Alignment System using Phylogenetic Sampling and Domain Decomposition

A Domain Decomposition Strategy for Alignment of Multiple Biological Sequences on Multiprocessor Platforms

PoMSA: An Efficient and Precise Position-based Multiple Sequence Alignment Technique

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

SaAlign: Multiple DNA/RNA Sequence Alignment and Phylogenetic Tree Construction Tool for Ultra-Large Datasets and Ultra-Long Sequences Based on Suffix Array

A Survey of Multiple Sequence Alignment Techniques.

HAlign-II: Efficient Ultra-Large Multiple Sequence Alignment and Phylogenetic Tree Reconstruction with Distributed and Parallel Computing

Multiple Sequence Alignment Based On A Suffix Tree And Center-Star Strategy: A Linear Method For Multiple Nucleotide Sequence Alignment On Spark Parallel Framework

CMSA: a Heterogeneous CPU/GPU Computing System for Multiple Similar RNA/DNA Sequence Alignment

A novel fast multiple nucleotide sequence alignment method based on FM-index

A Knowledge-Based Multiple-Sequence Alignment Algorithm

WMSA: a Novel Method for Multiple Sequence Alignment of DNA Sequences.

Multiple Sequence Alignment and Reconstructing Phylogenetic Trees with Hadoop.

FMAlign2: a Novel Fast Multiple Nucleotide Sequence Alignment Method for Ultralong Datasets.

Cluster-Distribute-Align-Merge: A General Algorithm to Speed Up Multiple Sequence Alignment on Multi-Core Computers

Kalign – an accurate and fast multiple sequence alignment algorithm

HAlign: Fast Multiple Similar DNA/RNA Sequence Alignment Based on the Centre Star Strategy

Efficient Bio-molecules Sequencing Using Multi-Objective Optimization and High-Performance Computing

MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment.

Pyro-Align: Sample-Align based Multiple Alignment system for Pyrosequencing Reads of Large Number

A Data Parallel Strategy for Aligning Multiple Biological Sequences on Homogeneous Multiprocessor Platform