Abstract:Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20–258.9 $\times$ higher than the state-of-the-art works.

Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

DaDianNao: A Machine-Learning Supercomputer

Full Stack Optimization of Transformer Inference: a Survey

Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment

A 28nm 49.7TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing.

FACT: FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction.

Improving Transformer Inference Through Optimized Non-Linear Operations with Quantization-Approximation-Based Strategy

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

Optimized Spatial Architecture Mapping Flow for Transformer Accelerators

A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models

Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms

Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures

Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow

Implementing and Optimizing the Scaled Dot-Product Attention on Streaming Dataflow

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers