Abstract:The attention mechanism is becoming a vital building block across various modern neural networks, e.g., Transformers. However, it encounters low efficiency when deployed on the general-purpose GPU/CPU platform, which motivates the dedicated accelerator design. Existing accelerators are commonly devised by exploring the potential sparsity in attention mechanism using a hardware-software co-design scheme, which suffers from complicated training, fine-tuning processes, and possible accuracy degradation. More importantly, the sparse pattern only focuses on certain datasets with less generality, and the fine-grained sparse pattern could also bring hardware inefficiency. Instead, we try to solve these issues from another perspective: By systematically analyzing the inherent dataflow characteristics of the attention mechanism, we propose the Co-Operative Systolic Arrays (COSA) with an optimized dataflow to support general purpose attention mechanism and pursue higher computational efficiency. COSA system exploits the high parallelism from the inherent model and leverages run-time configurable hybrid dataflows, i.e., weight and output stationary for a systolic array to support the varying matrix multiplication in the attention mechanism. Regarding the cascaded matrix multiplications, COSA proposes levels of fusion methodologies to reduce off-chip access and enhance PE utilization, such as directly using the result of output stationary as the weight of weight stationary systolic array by deep fusion. Additionally, COSA system also provides the solution to hide the latency and radically save the buffer size related to softmax. Experiment results show that, across various benchmarks, COSA can achieve 2.29-2.60× throughput improvement over the traditional systolic array of the same MAC number, with up to 94.7% PE utilization rate and 8.2× less off-chip memory access. Compared with general-purpose platforms, 7.6-12.4× energy efficiency over NVIDIA GeForce 3090 GPU and 35.2-80.9× energy efficiency over Intel 6226R server CPU.

CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

COSA Plus: Enhanced Co-Operative Systolic Arrays for Attention Mechanism in Transformers

COSA:Co-Operative Systolic Arrays for Multi-head Attention Mechanism in Neural Network Using Hybrid Data Reuse and Fusion Methodologies.

Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

CODEBench: A Neural Architecture and Hardware Accelerator Co-Design Framework

Implementing and Optimizing the Scaled Dot-Product Attention on Streaming Dataflow

DTATrans: Leveraging Dynamic Token-Based Quantization with Accuracy Compensation Mechanism for Efficient Transformer Architecture.

Invited: Algorithm-Software-Hardware Co-Design for Deep Learning Acceleration

Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment

Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow

Accelerating Attention Mechanism on FPGAs Based on Efficient Reconfigurable Systolic Array

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Transformer Acceleration with Dynamic Sparse Attention

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

F-CAD: A Framework to Explore Hardware Accelerators for Codec Avatar Decoding

High-Performance Method and Architecture for Attention Computation in DNN Inference

DNA: Differentiable Network-Accelerator Co-Search

Memory-Computing Decoupling: A DNN Multitasking Accelerator with Adaptive Data Arrangement.

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering