Abstract:Transformer networks have outperformed recurrent and convolutional neural networks in terms of accuracy in various sequential tasks. However, memory and compute bottlenecks prevent transformer networks from scaling to long sequences due to their high execution time and energy consumption. Different neural attention mechanisms have been proposed to lower computational load but still suffer from the memory bandwidth bottleneck. In-memory processing can help alleviate memory bottlenecks by reducing the transfer overhead between the memory and compute units, thus allowing transformer networks to scale to longer sequences. We propose an in-memory transformer network accelerator (iMTransformer) that uses a combination of crossbars and content-addressable memories to accelerate transformer networks. We accelerate transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, and (3) exploiting the available parallelism in the attention mechanism computation. To reduce energy consumption, the following techniques are introduced: (1) a configurable attention selector is used to choose different sparse attention patterns, (2) a content-addressable memory aided locality sensitive hashing helps to filter the number of sequence elements by their importance, and (3) FeFET-based crossbars are used to store projection weights while CMOS-based crossbars are used as an attentional cache to store attention scores for later reuse. Using a CMOS-FeFET hybrid iMTransformer introduced a significant energy improvement compared to the CMOS-only iMTransformer. The CMOS-FeFET hybrid iMTransformer achieved an 8.96× delay improvement and 12.57× energy improvement for the Vanilla transformers compared to the GPU baseline at a sequence length of 512. Implementing BERT using CMOS-FeFET hybrid iMTransformer achieves 13.71× delay improvement and 8.95× delay improvement compared to the GPU baseline at sequence length of 512. The hybrid iMTransformer also achieves a throughput of 2.23 K samples/sec and 124.8 samples/s/W using the MLPerf benchmark using BERT-large and SQuAD 1.1 dataset, an 11× speedup and 7.92× energy improvement compared to the GPU baseline.

Hyft: A Reconfigurable Softmax Accelerator with Hybrid Numeric Format for both Training and Inference

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

HyCTor: A Hybrid CNN-Transformer Network Accelerator with Flexible Weight/Output Stationary Dataflow and Multi-Core Extension

Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow

DTATrans: Leveraging Dynamic Token-Based Quantization with Accuracy Compensation Mechanism for Efficient Transformer Architecture.

TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

Improving Transformer Inference Through Optimized Non-Linear Operations with Quantization-Approximation-Based Strategy

Hardware-Efficient SoftMax Architecture With Bit-Wise Exponentiation and Reciprocal Calculation

FACT: FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction.

Hardware-Software Co-Design of an In-Memory Transformer Network Accelerator

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

TEA-S: A Tiny and Efficient Architecture for PLAC-Based Softmax in Transformers

ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm

Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators Through Attention Fusion

Hardware-friendly compression and hardware acceleration for transformer: A survey

Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Topkima-Former: Low-energy, Low-Latency Inference for Transformers using top-k In-memory ADC

ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters