Abstract:General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domain, the matrix size is small. To this end, prior work proposes batched GEMM to process a group of small independent GEMMs together by designing a single CUDA kernel for all of these GEMMs. However, the current support for batched GEMM is still rudimentary. Tiling and batching are tightly correlated. A large tile size can increase the data reuse, but it will decrease the thread-level parallelism, which further decrease the optimization space for the batching. A small tile size can increase the thread-level parallelism and then provide larger optimization space for the batching, but at the cost of sacrificing data reuse. In this paper, we propose a coordinated tiling and batching framework for accelerating GEMMs on GPUs. It is a two-phase framework, which consists of a tiling engine and a batching engine to perform efficient batched GEMM on GPUs. Tiling engine partitions the GEMMs into independent tiles and batching engine assigns the tiles to thread blocks. Moreover, we propose a general programming interface for the coordinated tiling and batching solution. Finally, experiment evaluation results on synthetic batched GEMM cases show that our framework can achieve about 1.40X performance speedup on average over the state-of-the-art technique. We also use GoogleNet as a real-world case study and our framework can achieve 1.23X speedup.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Improving Performance of Matrix Multiplication and FFT on GPU

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Performance Modeling and Optimization of Sparse Matrix-Vector Multiplication on NVIDIA CUDA Platform

Optimizing sparse general matrix–matrix multiplication for DCUs

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

Generalized GPU Acceleration for Applications Employing Finite-Volume Methods.

FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs

Optimizing Finite Volume Method Solvers on Nvidia GPUs.

A coordinated tiling and batching framework for efficient GEMM on GPUs.

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Large-scale FFT on GPU clusters

Improving Dense Linear Equation Solver on Hybrid CPU-GPU System.

Using GPUs to compute large out-of-card FFTs

Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies

Register-based Implementation of the Sparse General Matrix-Matrix Multiplication on GPUs

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures