Optimizing Stencil Computation on Multi-core DSPs

Fugeng Zhu,Jianbin Fang,Kainan Yu,Xinxin Qi,Tao Tang,Jing Xie,Jie Ren,Peng Zhang,Yonggang Che,Chun Huang
DOI: https://doi.org/10.1145/3673038.3673062
2024-01-01
Abstract:Stencil is a common computation pattern in high-performance computing (HPC) applications. While extensive work has been proposed to optimize stencil kernels on CPUs and GPUs, there is no consensus on how to best optimize stencils on multi-core Digital Signal Processors (DSPs) used in emerging HPC systems. This paper shares our experience in optimizing stencil kernels on multi-core DSPs. Our approach combines coarse and fine-grained parallel optimization techniques to enhance the performance of stencil computations. Our optimizations include a vectorization-enabled micro-kernel to utilize instruction parallelism, a memory-aware data reuse strategy to maximize data locality across multiple memory levels and a triple-buffering mechanism to overlap computation and memory communications. Experimental results show that our approach can effectively utilize the memory bandwidth and the computation capability of the underlying hardware. Our integrated optimizations can yield a 3.72x speedup over the 16-core CPU counterpart.
What problem does this paper attempt to address?