Abstract:In recent years, attention-based models have achieved impressive performance in natural language processing and computer vision applications by effectively capturing contextual knowledge from the entire sequence. However, the attention mechanism inherently contains a large number of redundant connections, imposing a heavy computational burden on model deployment. To this end, sparse attention has emerged as an attractive approach to reduce the computation and memory footprint, which involves the sampled dense-dense matrix multiplication (SDDMM) and sparse-dense matrix multiplication (SpMM) at the same time, thus requiring the hardware to eliminate zero-valued operations effectively. Existing techniques based on irregular sparse patterns or regular but coarse-grained patterns lead to low hardware efficiency or less computation saving. This paper proposes Sanger, a framework that harvests sparsity in the attention mechanism through synergistic hardware and software co-design. The software part prunes the attention matrix into a dynamic structured pattern, and the hardware part features a reconfigurable architecture that exploits such patterns. Specifically, we dynamically sparsify vanilla attention based on a quantized prediction of the attention matrix. Then, the sparse mask is re-arranged into structured blocks that are more amenable to hardware implementation. The hardware design of Sanger features a score-stationary dataflow that keeps sparse scores stationary in the PE to avoid decoding overhead. Using this dataflow and a reconfigurable systolic array design, we can unify the computation of SDDMM and SpMM operations. Typically, the PEs can be configured during runtime to support different data access and partial sum accumulation schemes. Experiments on BERT show that Sanger can prune the model to 0.08 - 0.27 sparsity without accuracy loss, achieving 4.64X, 22.7X, 2.39X, and 1.47X speedup compared to V100 GPU, AMD Ryzen Threadripper 3970X CPU, as well as the state-of-the-art attention accelerators A3 and SpAtten.

SAC: Accelerating and Structuring Self-Attention Via Sparse Adaptive Connection.

Fine-tune BERT with Sparse Self-Attention Mechanism.

Spatially-Aware Context Neural Networks.

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion

Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Multiscale Self Attentive Convolutions for Vision and Language Modeling

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

A Regularized Framework for Sparse and Structured Neural Attention

Sanger: A Co-Design Framework for Enabling Sparse Attention Using Reconfigurable Architecture.

Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

A Spatial–Channel–Temporal-Fused Attention for Spiking Neural Networks

Post-Training Sparse Attention with Double Sparsity

CACnet: Cube Attentional CNN for Automatic Speech Recognition

SACNet: A Scattered Attention-based Network with Feature Compensator for Visual Localization

Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together

FsaNet: Frequency Self-attention for Semantic Segmentation