DGST: Efficient and Scalable Suffix Tree Construction on Distributed Data-Parallel Platforms.

Guanghui Zhu,Chen Guo,Le Lu,Zhi Huang,Chunfeng Yuan,Rong Gu,Yihua Huang
DOI: https://doi.org/10.1016/j.parco.2019.06.002
IF: 0.983
2019-01-01
Parallel Computing
Abstract:The suffix tree is a fundamental data structure for string processing. It is widely used in many important scenarios such as text processing, information retrieval, and bioinformatics. With the rapid growth of data volume, constructing the suffix tree for large-scale datasets is very time-consuming. To solve this problem, a number of MPI-based parallel algorithms were proposed, but they have limitations in fault tolerance and scalability for large-scale datasets. Recently, there are ever-increasing application demands on efficient algorithms for constructing the suffix tree for large-scale datasets on distributed data-parallel platforms, such as Hadoop and Spark. In this paper, we present DGST, which is an efficient and scalable algorithm for generalized suffix tree construction on distributed data-parallel platforms. DGST consists of two major stages: parallel sub-tree partitioning and parallel sub-tree construction. We first design a novel data partitioning strategy for both two stages in the data-parallel paradigm. Then, we propose an efficient sub-tree partitioning algorithm based on parallel frequency counting. To improve the load balance and amortize the disk I/O costs, we propose an efficient Bin-Packing and Number-Partitioning based task allocation strategy for the sub-tree construction. At the sub-tree construction stage, we further propose a novel data structure LCP-Range and a multi-way LCP-Merge sorting algorithm for parallel LCP array construction. The experimental results on Apache Spark reveal that DGST outperforms the state-of-the-art ERa algorithm with approximately 3 times speedup on both the DNA and English text datasets. Furthermore, DGST achieves near-linear data and node scalability. (C) 2019 Elsevier B.V. All rights reserved.
What problem does this paper attempt to address?