A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

Haotian Wang,Wangdong Yang,Rong Hu,Renqiu Ouyang,Kenli Li,Keqin Li
DOI: https://doi.org/10.1109/tpds.2023.3288520
IF: 5.3
2023-07-04
IEEE Transactions on Parallel and Distributed Systems
Abstract:Analysis of multi-dimensional data, especially tensor decomposition, which extracts latent information, is becoming considerably popular. Although multi-dimensional sparse data is typically processed on multi-core processors, developing highly optimized GPU-based Sparse Tensor Matrix Chain Multiplication (SpTMCM) is challenging. The purpose of this paper is to investigate a novel approach named SpTMCM and to explore the discovery of SpTMCM coupled with the emerging computing core, Tensor Core Unit (TCU). In contrast to prior work, the proposed novel approach enables a uniform storage format and optimization approach for SpTMCM. We design a hybrid tensor format based on multi-dimensional tiling that divides the tensor depending on the tile threshold to address the inefficient memory accesses caused by the irregular nonzero distribution of the sparse tensor. Further, we develop a TCU-based tensor parallel algorithm with our novel approach to increase the memory bandwidth. Compared to state-of-the-art works, our method achieves 1.16∼24.12× speedup for SpMTTKRP and 5.07∼7.15× speedup for SpTTMChain across NVIDIA A100 GPU on a range of real-w- rld sparse tensors.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?