Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems

Yuedan Chen,Guoqing Xiao,M. Tamer Ozsu,Zhuo Tang,Albert Y. Zomaya,Kenli Li
DOI: https://doi.org/10.1109/icde53745.2022.00234
2022-01-01
Abstract:Canonical Polyadic Decomposition (CPD) of sparse tensors is an effective tool in various machine learning and data analytics applications, in which sparse Matricized Tensor Times Khatri-Rao Product (MTTKRP) is the major performance bottleneck. To overcome this bottleneck and support efficient applications, this paper presents HPSpTM, an efficient sparse MTTKRP framework, to exploit the multi-level parallelism and reusability on heterogeneous HPC systems. HPSpTM incorporates: (1) a multi-level matrix-driven tiling engine that leverages the process- and thread-level parallelism of the underlying platform and data reusability based on the derived factor matrix-driven MTTKRP algorithm; (2) a tensor-driven parallel execution that enables buffering-aware scheduling and pipeline scheduling to optimize the performance in the tile granularity; (3) a partition-aware light weight data storage that exploits better data locality based on the proposed hierarchical and fine-grained execution; and (4) a performance auto-tuning technique that offers large flexibility for tile size auto-adjusting across various input datasets based on a designed runtime model. Our experiments show that HPSpTM on a Nvidia Tesla P100 obtains the average performance improvement of up to 76.46% over the state-of-the-arts, and HPSpTM achieves the speedup of up to 15.39× when scaling from 8 to 128 core groups, corresponding to processes, on the Sunway TaihuLight supercomputer.
What problem does this paper attempt to address?