Optimizing General Matrix Multiplications on Modern Multi-core DSPs

Kainan Yu,Xinxin Qi,Peng Zhang,Jianbin Fang,Dezun Dong,Ruibo Wang,Tao Tang,Chun Huang,Yonggang Che,Zheng Wang
DOI: https://doi.org/10.1109/ipdps57955.2024.00090
2024-01-01
Abstract:General Matrix Multiplication (GEMM) is a key subprogram in high-performance computing (HPC) and deep learning workloads. With the rising significance of power and energy consumption in HPC systems, accelerators based on Digital Signal Processors (DSPs) have been integrated into general-purpose HPC systems. Due to the architecture disparities, the GEMM optimization techniques used on conventional multi-core CPUs and GPGPUs are not always applicable to DSPs. This paper shares our experience in optimizing GEMM on multi-core GPDSPs, using a CPU-DSP processor as a case study. Our approach employs a range of techniques to optimize performance for DSP architectures. These include data partitioning, three-level pipelining, dedicated micro-kernel design, and improved vector reduction. These optimizations maximize the overlap between computation and communication while fully exploiting the capabilities of floating-point arithmetic units to achieve high performance. Our experimental results demonstrate that the performance attained by our optimization is up to 96% of the theoretical peak performance of the hardware.
What problem does this paper attempt to address?