Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Jie Lei,Enrique S. Quintana-Ortí
2024-04-23
Abstract:This paper investigates the design of parallel general matrix multiplication (GEMM) for a Versal Adaptive Compute Accelerated Platform (ACAP) equipped with a VC1902 system-on-chip and multiple Artificial Intelligence Engines (AIEs). Our efforts aim to port standard optimization techniques applied in the high-performance realization of GEMM on CPUs to the Versal ACAP. In particular, 1) we address the flexible exploitation of the Versal ACA multi-level memory hierarchy; 2) we delve into the efficient use of the vector units in the AIE tiles, proposing an architecture-specific micro-kernel for mixed precision arithmetic to address the strong demand for adaptive-precision inference in deep learning; and 3) we introduce a parallel design for GEMM that spans multiple AIE tiles, enhancing the computational throughput. We conduct experimental profiling, with up to 32 AI Engines, that demonstrates the high parallel scalability of the solution.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper discusses the problem of implementing parallel general matrix multiplication (GEMM) on the AMD Versal ACAP (Adaptive Compute Accelerated Platform), which is equipped with a VC1902 system chip and multiple AI engines (AIEs). The main goal of the research is to apply standard optimization techniques from the CPU to the Versal ACAP to improve the computational efficiency in deep learning. The specific contributions include: 1. Utilizing the multi-level memory hierarchy of the Versal ACAP for flexible data storage and processing. 2. Designing an architecture-specific microkernel for mixed-precision operations to meet the requirements of low-precision inference in deep learning. 3. Proposing a parallel GEMM design across multiple AIE tiles to improve computational throughput and conducting experimental performance analysis. The paper first introduces the performance bottleneck in single-core computer architecture due to the slowing down of Moore's Law and Dennard scaling, as well as the development of multi-core processors and domain-specific accelerators. Then, it discusses in detail how to map the parallel GEMM algorithm from high-performance libraries like GotoBLAS2 to the Versal ACAP, particularly utilizing the SIMD units and memory hierarchy of the AIE tiles. The paper also presents a microkernel specifically for the Versal ACAP to perform mixed-precision operations and explores how to parallelize GEMM across multiple AIE tiles for higher computational efficiency. Through experiments, the paper demonstrates the high parallel scalability of the proposed scheme, involving up to 32 AI engines. Finally, the paper conducts a comprehensive performance analysis of multiple SIMD designs, identifies communication bottlenecks, and proposes optimization strategies.