What problem does this paper attempt to address?

### The problems the paper attempts to solve This paper aims to solve the problem of efficiently constructing a Suffix Array in a distributed - memory environment. Specifically, the authors attempt to develop a fast and memory - efficient distributed suffix - array construction algorithm to meet the demands of processing massive amounts of text data in the modern information age. #### Problem background With the development of information technology, the amount of text data to be processed is growing exponentially. For example: - The English Wikipedia contains approximately 60 million pages and adds about 2.5 million pages annually. - All public source code repositories created by over 100 million developers on GitHub require more than 21 terabytes of storage space. - Genomic sequencing capabilities have also increased rapidly with technological progress. These examples show that when analyzing large amounts of text information, it is crucial to scale the capabilities of algorithms, and the suffix array is the basis for many text - processing algorithms. #### Existing challenges Although the current state - of - the - art distributed suffix - array construction algorithms are fast, they require a large amount of working memory (usually 30 to 60 times the input size). In addition, these algorithms have a significant space - time trade - off: memory - efficient algorithms are usually slower. Therefore, the researchers pose the following question: **Is it possible to implement a scalable, fast, and memory - efficient suffix - array construction algorithm in a distributed - memory environment?** ### Main contributions of the paper To answer the above question, the authors make the following main contributions: 1. **A fast and memory - efficient distributed suffix - array construction algorithm**: This algorithm is based on the DCX algorithm and optimizes memory usage through techniques such as bucket sorting. 2. **A new random blocking scheme for load balancing**: This scheme not only improves the efficiency of the algorithm but also ensures load balancing among different computing nodes. 3. **Techniques applicable to other distributed computing models and algorithms**: The proposed method can be generalized to other distributed or external - memory models. ### Method overview The algorithm proposed in the paper combines multiple optimization techniques in a distributed - memory environment, including but not limited to: - **Bucket sorting**: By assigning elements to different buckets, the memory footprint during global sorting is reduced. - **Random block redistribution**: By randomly re - distributing blocks of the input text, the load on each processing unit is balanced. - **Discarding and packing optimizations**: Discarding sorted unique - ranked suffixes and packing small character sets to save memory. These techniques work together to make the new algorithm highly efficient while significantly reducing memory requirements, thus better adapting to the processing requirements of large - scale text data.

Fast and Lightweight Distributed Suffix Array Construction -- First Results

Using GPU to Accelerate Suffix Array Construction

Fast, parallel, and cache-friendly suffix array construction

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

DGST: Efficient and Scalable Suffix Tree Construction on Distributed Data-Parallel Platforms.

Generic Non-Recursive Suffix Array Construction

Suffix sorting via matching statistics

Prefix Sorting DFAs: a Recursive Algorithm

Scalable and Efficient Construction of Suffix Array with MapReduce and In-Memory Data Store System

Parallel Lexicographic Names Construction with CUDA

F-Code: An Optimized Mds Array Code

Pointer-Machine Algorithms for Fully-Online Construction of Suffix Trees and DAWGs on Multiple Strings

Two simple full-text indexes based on the suffix array

Scalable Distributed String Sorting

Dynamic Suffix Array with Polylogarithmic Queries and Updates

FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence.

Distributed Matrix Computations with Low-weight Encodings

Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms.

Parallel mining of time-faded heavy hitters

Partial Data Compression and Text Indexing via Optimal Suffix Multi-Selection

Optimal In-Place Suffix Sorting.