Abstract:Transformer-based models achieve tremendous success in many artificial intelligence (AI) tasks, outperforming conventional convolution neural networks (CNNs) from natural language processing (NLP) to computer vision (CV). Their success relies on the self-attention mechanism that provides a global rather than local receptive field as CNNs. Despite its superiority, the global–level self-attention consumes more operations than CNNs and cannot be effectively handled by the existing CNN processor due to the distinct operations. It inspires an urgent requirement to design a dedicated Transformer processor. However, global self-attention involves massive naturally existent weakly related tokens (WR-Tokens) due to the redundant contents in human languages or images. These WR-Tokens generate zero and near-zero attention results that introduce energy consumption bottleneck, redundant computations, and hardware under-utilization issues, making it challenging to achieve energy-efficient self-attention computing. This article proposes a Transformer processor effectively handling the WR-Tokens to solve these challenges. First, a big-exact-small-approximate processing element (PE) reduces multiply-and-accumulate (MAC) energy for WR-Tokens by adaptively computing the small values approximately while computing the large values exactly. Second, a bidirectional asymptotical speculation unit captures and removes redundant computations of zero attention outputs by exploiting the local property of self-attention. Third, an out-of-order PE-line computing scheduler improves hardware utilization for near-zero values by reordering the operands to dovetail two operations into one multiplication. Fabricated in a 28-nm CMOS technology, the proposed processor occupies an area of 6.82 mm2. When evaluated with a 90% of approximate computing for the generative pre-traine- transformer 2 (GPT-2) model, the peak energy efficiency is 27.56 TOPS/W under 0.56 V at 50 MHz, higher than A100 graphics processing unit (GPU). Compared with the state-of-the-art Transformer processor, it reduces energy by and offers speedup.

Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs

MOC: Multi-Objective Mobile CPU-GPU Co-Optimization for Power-Efficient DNN Inference

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures

Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures

Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms

Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention.

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention

Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition

AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

RACE-IT: A Reconfigurable Analog CAM-Crossbar Engine for In-Memory Transformer Acceleration

Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems