Abstract:Multi-head self-attention (attention mechanism) has been employed in a variety of fields such as machine translation, language modeling, and image processing due to its superiority in feature extraction and sequential data analysis. This is benefited from a large number of parameters and sophisticated model architecture behind the attention mechanism. To efficiently deploy attention mechanism on resource-constrained devices, existing works propose to reduce the model size by building a customized smaller model or compressing a big standard model. A customized smaller model is usually optimized for the specific task and needs effort in model parameters exploration. Model compression reduces model size without hurting the model architecture robustness, which can be efficiently applied to different tasks. The compressed weights in the model are usually regularly shaped (e.g. rectangle) but the dimension sizes vary (e.g. differs in rectangle height and width). Such compressed attention mechanism can be efficiently deployed on CPU/GPU platforms as their memory and computing resources can be flexibly assigned with demand. However, for Field Programmable Gate Arrays (FPGAs), the data buffer allocation and computing kernel are fixed at run time to achieve maximum energy efficiency. After compression, weights are much smaller and different in size, which leads to inefficient utilization of FPGA on-chip buffer. Moreover, the different weight heights and widths may lead to inefficient FPGA computing kernel execution. Due to the large number of weights in the attention mechanism, building a unique buffer and computing kernel for each compressed weight on FPGA is not feasible. In this work, we jointly consider the compression impact on buffer allocation and the required computing kernel during the attention mechanism compressing. A novel structural pruning method with memory footprint awareness is proposed and the associated accelerator on FPGA is designed. The experimental results show that our work can compress Transformer (an attention mechanism based model) by 95x. The developed accelerator can fully utilize the FPGA resource, processing the sparse attention mechanism with the run-time throughput performance of 1.87 Tops in ZCU102 FPGA.

Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators Through Attention Fusion

FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

TransFRU: Efficient Deployment of Transformers on FPGA with Full Resource Utilization

Fitop-Trans: Maximizing Transformer Pipeline Efficiency Through Fixed-Length Token Pruning on FPGA

ProTEA: Programmable Transformer Encoder Acceleration on FPGA

Unified Accelerator for Attention and Convolution in Inference Based on FPGA

Hardware-Software Co-Design of an In-Memory Transformer Network Accelerator

TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs

FPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks

ViA: A Novel Vision-Transformer Accelerator Based on FPGA

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

FET-OPU: A Flexible and Efficient FPGA-Based Overlay Processor for Transformer Networks

Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow

FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices

LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks

ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm

Hardware-friendly compression and hardware acceleration for transformer: A survey