Abstract:General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domain, the matrix size is small. To this end, prior work proposes batched GEMM to process a group of small independent GEMMs together by designing a single CUDA kernel for all of these GEMMs. However, the current support for batched GEMM is still rudimentary. Tiling and batching are tightly correlated. A large tile size can increase the data reuse, but it will decrease the thread-level parallelism, which further decrease the optimization space for the batching. A small tile size can increase the thread-level parallelism and then provide larger optimization space for the batching, but at the cost of sacrificing data reuse. In this paper, we propose a coordinated tiling and batching framework for accelerating GEMMs on GPUs. It is a two-phase framework, which consists of a tiling engine and a batching engine to perform efficient batched GEMM on GPUs. Tiling engine partitions the GEMMs into independent tiles and batching engine assigns the tiles to thread blocks. Moreover, we propose a general programming interface for the coordinated tiling and batching solution. Finally, experiment evaluation results on synthetic batched GEMM cases show that our framework can achieve about 1.40X performance speedup on average over the state-of-the-art technique. We also use GoogleNet as a real-world case study and our framework can achieve 1.23X speedup.

MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-core Processor

MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using DRAM Technology

OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

A 98 Gmacs/W 32-Core Vector Processor In 65 Nm Cmos

A coordinated tiling and batching framework for efficient GEMM on GPUs.

High Performance Matrix Multiplication on Many Cores

CINOC: Computing in Network-On-Chip with Tiled Many-Core Architectures for Large-Scale General Matrix Multiplications

Multi-Objective Hardware-Mapping Co-Optimisation for Multi-DNN Workloads on Chiplet-based Accelerators

Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm

MOCHA: Multinode Cost Optimization in Heterogeneous Clouds with Accelerators

Efficient SNN multi-cores MAC array acceleration on SpiNNaker 2

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Preliminary Investigation Of Accelerating Molecular Dynamics Simulation On Godson-T Many-Core Processor

A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer

Parallel GEMM-based convolution for deep learning on multicore RISC-V processors

GRAPHIC: Gather and Process Harmoniously in the Cache with High Parallelism and Flexibility