Fast and Lightweight Distributed Suffix Array Construction -- First Results

Manuel Haag,Florian Kurpicz,Peter Sanders,Matthias Schimek
2024-12-13
Abstract:We present first algorithmic ideas for a practical and lightweight adaption of the DCX suffix array construction algorithm [Sanders et al., 2003] to the distributed-memory setting. Our approach relies on a bucketing technique which enables a lightweight implementation which uses less than half of the memory required by the currently fastest distributed-memory suffix array algorithm PSAC [Flick and Aluru, 2015] while being competitive or even faster in terms of running time.
Data Structures and Algorithms
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper aims to solve the problem of efficiently constructing a Suffix Array in a distributed - memory environment. Specifically, the authors attempt to develop a fast and memory - efficient distributed suffix - array construction algorithm to meet the demands of processing massive amounts of text data in the modern information age. #### Problem background With the development of information technology, the amount of text data to be processed is growing exponentially. For example: - The English Wikipedia contains approximately 60 million pages and adds about 2.5 million pages annually. - All public source code repositories created by over 100 million developers on GitHub require more than 21 terabytes of storage space. - Genomic sequencing capabilities have also increased rapidly with technological progress. These examples show that when analyzing large amounts of text information, it is crucial to scale the capabilities of algorithms, and the suffix array is the basis for many text - processing algorithms. #### Existing challenges Although the current state - of - the - art distributed suffix - array construction algorithms are fast, they require a large amount of working memory (usually 30 to 60 times the input size). In addition, these algorithms have a significant space - time trade - off: memory - efficient algorithms are usually slower. Therefore, the researchers pose the following question: **Is it possible to implement a scalable, fast, and memory - efficient suffix - array construction algorithm in a distributed - memory environment?** ### Main contributions of the paper To answer the above question, the authors make the following main contributions: 1. **A fast and memory - efficient distributed suffix - array construction algorithm**: This algorithm is based on the DCX algorithm and optimizes memory usage through techniques such as bucket sorting. 2. **A new random blocking scheme for load balancing**: This scheme not only improves the efficiency of the algorithm but also ensures load balancing among different computing nodes. 3. **Techniques applicable to other distributed computing models and algorithms**: The proposed method can be generalized to other distributed or external - memory models. ### Method overview The algorithm proposed in the paper combines multiple optimization techniques in a distributed - memory environment, including but not limited to: - **Bucket sorting**: By assigning elements to different buckets, the memory footprint during global sorting is reduced. - **Random block redistribution**: By randomly re - distributing blocks of the input text, the load on each processing unit is balanced. - **Discarding and packing optimizations**: Discarding sorted unique - ranked suffixes and packing small character sets to save memory. These techniques work together to make the new algorithm highly efficient while significantly reducing memory requirements, thus better adapting to the processing requirements of large - scale text data.