Abstract:Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations for diverse sparse-MHSA patterns due to the underlying sparse formats they operate on. These formats, which are typically designed for high-performance & scientific computing applications, are either curated for extreme amounts of random sparsity (<1% non-zero values), or specific sparsity patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in existing sparse-formats trading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a novel sparse format: affine-compressed-sparse-row (ACSR) and supporting code-generation scheme, SPLAT, that generates high-performance implementations for diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code generation algorithm is the observation that common sparse-MHSA patterns have uniquely regular geometric properties. These properties, which can be analyzed just-in-time, expose novel optimizations and tiling strategies that SPLAT exploits to generate high-performance implementations for diverse patterns. To demonstrate SPLAT's efficacy, we use it to generate code for various sparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over hand-written kernels written in triton and TVM respectively on A100 GPUs. Moreover, its interfaces are intuitive and easy to use with existing implementations of MHSA in JAX.

Mentha: Enabling Sparse-Packing Computation on Systolic Arrays.

SPSA: Exploring Sparse-Packing Computation on Systolic Arrays from Scratch

SPMSD: an Partitioning-Strategy for Parallel General Sparse Matrix-Matrix Multiplication on GPU

SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow.

ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel

Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer

Accelerating Unstructured SpGEMM using Structured In-situ Computing

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

A sparse matrix vector multiplication accelerator based on high-bandwidth memory

High Performance Unstructured SpMM Computation Using Tensor Cores

Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication.

GAS: General-Purpose In-Memory-Computing Accelerator for Sparse Matrix Multiplication

BafSP: Co-Design of Compute SRAM and Bit-Aware Data Flip Mitigation with In-Memory Sparsity Detection for SpMM

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

Efficient Sparse Matrix-Vector Multiplication Using Cache Oblivious Extension Quadtree Storage Format.

SpMMPlu: A Compiler Plug-in with Sparse IR for Efficient Sparse Matrix Multiplication.

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication