Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors

Hucheng Liu,Shaohuai Shi,Xuan Wang,Zoe L. Jiang,Qian Chen
DOI: https://doi.org/10.23919/date58400.2024.10546786
2024-01-01
Abstract:General matrix multiplication (GEMM) as a fundamental subroutine has been widely used in many applications like scientific computing, machine learning, etc. Although many studies are dedicated to optimizing its performance, they mainly focus on matrices with regular shapes or x86 platforms. The irregularly shaped matrices on GEMM running on modern ARMv8 processors are under-explored. In this paper, we provide a thorough performance analysis of the general block-panel multiplication (GEBP) kernel of GEMM that has irregular shapes. Based on our analysis, we propose a new GEMM algorithm named EPPA with three novel schemes to improve GEMM performance on ARMv8 processors: i) eliminating packing to reduce L1 cache contention, ii) avoiding data eviction and pre-fetching data to reduce the L1 cache miss penalty, and iii) an adaptive selection strategy of the above two and original schemes. We conduct extensive experiments with a large range of irregular matrices on three popular ARMv8 processors compared to seven state-of-the-art GEMM libraries. The experimental results show that our EPPA algorithm outperforms existing ones across workloads and processors and accelerates real-world applications.
What problem does this paper attempt to address?