A One-for-All and <i>O</i>(<i>V</i> log(<i>V</i>))-Cost Solution for Parallel Merge Style Operations on Sorted Key-Value Arrays

Bangyan Wang,Lei Deng,Fei Sun,Guohao Dai,Liu Liu,Yu Wang,Yuan Xie
DOI: https://doi.org/10.1145/3503222.3507728
2022-01-01
Abstract:The processing of sorted key-value arrays using a "merge style operation (MSO)" is a very basic and important problem in domains like scientific computing, deep learning, database, graph analysis, sorting, set-operation etc. MSOs dominate the execution time in some important applications like SpGEMM and graph mining. For example, sparse vector addition as an MSO takes up to 98% execution time in SpGEMM in our experiment. For this reason, accelerating MSOs on CPU, GPU, and accelerators using parallel execution has been extensively studied but the solutions in prior work have three major limitations. (1) They treat different MSOs as isolated problems using incompatible methods and an unified solution is still lacking. (2) They do not have the flexibility to support variable key/value sizes and value calculations in the runtime given a fixed hardware design. (3) They require a quadratic hardware cost (O(V-2)) for given parallelism V in most cases. To address above three limitations, we make the following efforts. (1) We present a one-for-all solution to support all interested MSOs based on a unified abstraction model "restricted zip machine (RZM)". (2) We propose a set of composable and parallel primitives for RZM to provide the flexibility to support variable key/value sizes and value calculations. (3) We provide the hardware design to implement the proposed primitives using only O(V log(V)) resource. With the above techniques, a flexible and efficient solution for MSOs has been built. Our design can be used either as a drop-in replacement of the merge unit in prior accelerators to reduce the cost from O(V-2) to O(V log(V)), or as an extension to the SIMD ISA of CPU and GPU. In our evaluation on CPU, when V = 16 (512-bit SIMD, 32-bit element), we achieve significant speedup on a range of representative kernels including set operations (8.4x), database joins (7.3x), sparse vector/matrix/tensor addition/multiplication on real/complex numbers (6.5x), merge sort (8.0x over scalar, 3.4x over the state-of-the-art SIMD), and SpGEMM (4.4x over the best one in the baseline collection).
What problem does this paper attempt to address?