SLPal: Accelerating Long Sequence Alignment on Many-Core and Multi-Core Architectures
Xiaoming Xu,Yuandong Chan,Kai Xu,Jikai Zhang,Xiaoning Wang,Zekun Yin,Weiguo Liu
DOI: https://doi.org/10.1109/BIBM49941.2020.9313429
2020-01-01
Abstract:Biological Sequence alignment is a fundamental application in bioinformatics. It can be used to identify functionally conserved sequences and find evolutionary relationships between species. To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. Global alignments are important because they reveal the shared order of biological features in the compared species, and produce a more accurate alignment at the base-pair level when the features are in the same order. The best known global alignment algorithm is Needleman-Wunsch, later, BitPAl, a bit parallel algorithm for general, integer scoring global algorithm, provides a new implementation of Needleman-Wunsch algorithm (BitNW). Compared with original Needleman-Wunsch algorithm, BitNW is significantly faster by exploiting bit parallelism. A number of parallel strategies have been proposed to accelerate exact alignment methods. However, most of them failed to align long biological sequences due to quadratic time complexity. In this paper, we propose SLPal, a fast bit-parallel algorithm for accelerating long DNA sequence comparison on Intel many-core and multi-core architectures. In order to fully exploit the computing power of many cores and the 512-bit vector processing units (VPUs), we use a two-level parallelism scheme: coarse-grained thread level and fine-grained VPU level approaches. In thread level, the alignment scoring matrix will be split into small tiles and multiple threads will process these small tiles currently by using Intel TBB library. In the VPU level, the computing kernels are implemented using the Single Instruction Multiple Data (SIMD) instructions, thus, 16 independent integers reside in a 512-bit vector register can be processed simultaneously. The evaluation reveals that our algorithm achieves a stable performance for all benchmark data and yields a performance of up to 511.7 (617.2) GCUPS on a server with single Xeon Phi 7210 processor (dual Xeon Gold 6148 20-core processors). Furthermore, our test shows that SLPal can align two sequences with about 5 million bps in 50 seconds on our server equipped with dual Xeon Gold 6148 CPUs.