Abstract:The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS implementations rely on a layered approach that consists of tiling and packing layers, for data (re)organization, and micro kernels that perform the actual computations. The creation of high-performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to deliver high-performance matrix operations. This paper presents a compiler-only alternative to the use of high-performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to 22x (Intel), for small matrices, and more than 6x (POWER9), for large matrices, faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high-performance libraries and is only 34% slower than OpenBLAS and on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over 2.6x faster than the vector-extension solution, matches Eigen performance, and achieves up to 96% of BLAS peak performance.

Generation of the Single Precision BLAS library for the Parallella platform, with Epiphany co-processor acceleration, using the BLIS framework

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Developing a BLAS library for the AMD AI Engine

Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

Performance and Energy Optimization of Matrix Multiplication on Asymmetric big.LITTLE Processors

FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs

Lasa: Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs

Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Parallel Photonic Acceleration Processor for Matrix-Matrix Multiplication

Implementing Strassen's Algorithm with BLIS

High-Throughput MPSoC Implementation of Sparse Bayesian Learning Algorithm

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Toward matrix multiplication for deep learning inference on the Xilinx Versal

Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

CLBlast: A Tuned OpenCL BLAS Library

Cascading GEMM: High Precision from Low Precision

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures