Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Jie Lei,Enrique S. Quintana-Ortí

2024-04-23

Abstract:This paper investigates the design of parallel general matrix multiplication (GEMM) for a Versal Adaptive Compute Accelerated Platform (ACAP) equipped with a VC1902 system-on-chip and multiple Artificial Intelligence Engines (AIEs). Our efforts aim to port standard optimization techniques applied in the high-performance realization of GEMM on CPUs to the Versal ACAP. In particular, 1) we address the flexible exploitation of the Versal ACA multi-level memory hierarchy; 2) we delve into the efficient use of the vector units in the AIE tiles, proposing an architecture-specific micro-kernel for mixed precision arithmetic to address the strong demand for adaptive-precision inference in deep learning; and 3) we introduce a parallel design for GEMM that spans multiple AIE tiles, enhancing the computational throughput. We conduct experimental profiling, with up to 32 AI Engines, that demonstrates the high parallel scalability of the solution.

Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

This paper discusses the problem of implementing parallel general matrix multiplication (GEMM) on the AMD Versal ACAP (Adaptive Compute Accelerated Platform), which is equipped with a VC1902 system chip and multiple AI engines (AIEs). The main goal of the research is to apply standard optimization techniques from the CPU to the Versal ACAP to improve the computational efficiency in deep learning. The specific contributions include: 1. Utilizing the multi-level memory hierarchy of the Versal ACAP for flexible data storage and processing. 2. Designing an architecture-specific microkernel for mixed-precision operations to meet the requirements of low-precision inference in deep learning. 3. Proposing a parallel GEMM design across multiple AIE tiles to improve computational throughput and conducting experimental performance analysis. The paper first introduces the performance bottleneck in single-core computer architecture due to the slowing down of Moore's Law and Dennard scaling, as well as the development of multi-core processors and domain-specific accelerators. Then, it discusses in detail how to map the parallel GEMM algorithm from high-performance libraries like GotoBLAS2 to the Versal ACAP, particularly utilizing the SIMD units and memory hierarchy of the AIE tiles. The paper also presents a microkernel specifically for the Versal ACAP to perform mixed-precision operations and explores how to parallelize GEMM across multiple AIE tiles for higher computational efficiency. Through experiments, the paper demonstrates the high parallel scalability of the proposed scheme, involving up to 32 AI engines. Finally, the paper conducts a comprehensive performance analysis of multiple SIMD designs, identifies communication bottlenecks, and proposes optimization strategies.

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Toward matrix multiplication for deep learning inference on the Xilinx Versal

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on ACAP

Exploiting On-chip Heterogeneity of Versal Architecture for GNN Inference Acceleration

Flexible Acceleration Framework for Dense/Sparse Matrix Multiplication on Versal ACAP

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP

H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture

AMA: an Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Programmable System-on-Chip

Versal: The Xilinx Adaptive Compute Acceleration Platform (ACAP)

Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors

Micro-kernels for portable and efficient matrix multiplication in deep learning

BLAS3 Optimization for the Godson-3B1500

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

Developing a BLAS library for the AMD AI Engine

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication