Abstract:In recent years, attention-based models have achieved impressive performance in natural language processing and computer vision applications by effectively capturing contextual knowledge from the entire sequence. However, the attention mechanism inherently contains a large number of redundant connections, imposing a heavy computational burden on model deployment. To this end, sparse attention has emerged as an attractive approach to reduce the computation and memory footprint, which involves the sampled dense-dense matrix multiplication (SDDMM) and sparse-dense matrix multiplication (SpMM) at the same time, thus requiring the hardware to eliminate zero-valued operations effectively. Existing techniques based on irregular sparse patterns or regular but coarse-grained patterns lead to low hardware efficiency or less computation saving. This paper proposes Sanger, a framework that harvests sparsity in the attention mechanism through synergistic hardware and software co-design. The software part prunes the attention matrix into a dynamic structured pattern, and the hardware part features a reconfigurable architecture that exploits such patterns. Specifically, we dynamically sparsify vanilla attention based on a quantized prediction of the attention matrix. Then, the sparse mask is re-arranged into structured blocks that are more amenable to hardware implementation. The hardware design of Sanger features a score-stationary dataflow that keeps sparse scores stationary in the PE to avoid decoding overhead. Using this dataflow and a reconfigurable systolic array design, we can unify the computation of SDDMM and SpMM operations. Typically, the PEs can be configured during runtime to support different data access and partial sum accumulation schemes. Experiments on BERT show that Sanger can prune the model to 0.08 - 0.27 sparsity without accuracy loss, achieving 4.64X, 22.7X, 2.39X, and 1.47X speedup compared to V100 GPU, AMD Ryzen Threadripper 3970X CPU, as well as the state-of-the-art attention accelerators A3 and SpAtten.

Fine-tune BERT with Sparse Self-Attention Mechanism.

Improving BERT with Self-Supervised Attention

SAC: Accelerating and Structuring Self-Attention Via Sparse Adaptive Connection.

SesameBERT: Attention for Anywhere

SparseBERT: Rethinking the Importance Analysis in Self-attention

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

A Sparse Self-Attention Enhanced Model for Aspect-Level Sentiment Classification

Enhanced Aspect-Based Sentiment Analysis Models with Progressive Self-supervised Attention Learning

Context-Guided BERT for Targeted Aspect-Based Sentiment Analysis

Sensi-BERT: Towards Sensitivity Driven Fine-Tuning for Parameter-Efficient BERT

SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification

Composable Sparse Fine-Tuning for Cross-Lingual Transfer

BERTer: The Efficient One

Post-Training Sparse Attention with Double Sparsity

Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

Self attention mechanism of bidirectional information enhancement

Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning

Improved Visual Fine-tuning with Natural Language Supervision

Sanger: A Co-Design Framework for Enabling Sparse Attention Using Reconfigurable Architecture.

Alleviating Over-smoothing for Unsupervised Sentence Representation.