Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs

Weiling Yang,Jianbin Fang,Dezun Dong,Xing Su,Zheng Wang
DOI: https://doi.org/10.1109/tpds.2024.3350368
IF: 5.3
2024-02-02
IEEE Transactions on Parallel and Distributed Systems
Abstract:General Matrix Multiplication (GEMM) is a key subroutine in high-performance computing. While the mainstream Basic Linear Algebra Subprograms (BLAS) libraries can deliver good performance on large and regular-shaped GEMMs, they are inadequate for optimizing small and irregular-shaped GEMMs, which are commonly seen in emerging HPC applications. Recent research has focused on improving GEMM performance on GPUs, but there is still significant room for improvement on emerging HPC hardware based on multi-core CPUs. We present LibShalom2, an open-source library to optimize full-spectrum GEMMs, taking small, irregular-shaped, and large-scale regular-shaped matrices. LibShalom2 explicitly targets the ARMv8 architecture, which is becoming common in HPC systems. LibShalom2 is designed to minimize the expensive memory accessing overhead for data packing and processing small matrices. It uses analytic methods to determine GEMM kernel optimization parameters, enhancing the computation and parallelization efficiency of the GEMM kernels. We evaluate LibShalom2 by applying it to three ARMv8 multi-core architectures and comparing it against five mainstream linear algebra libraries. Experimental results show that LibShalom2 consistently outperforms existing solutions across full-spectrum GEMM workloads and hardware architectures. We also show that LibShalom2 delivers an average speedup of 2.2x for real-life neural network workloads.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?