Abstract:Models based on the attention mechanism, i.e., transformers, have shown extraordinary performance in natural language processing (NLP) tasks. However, their memory footprint, inference latency, and power consumption are still prohibitive for efficient inference at edge devices, even at data centers. To tackle this issue, we present an algorithm-architecture co-design named DTATrans. We find empirically that the tolerance to the noise varies from token to token in attention-based NLP models. This finding leads us to dynamically quantize different tokens with mixed levels of bits. Furthermore, we find that the overstrict quantization method causes a dilemma of the model accuracy and model compression ratio, which impels us to explore a method to compensate for the model accuracy when the compression ratio is high. Thus, in DTATrans, we design a compression framework that: 1) dynamically quantizes tokens while they are forwarded in the models; 2) jointly determines the ratio of each precision; and 3) compensate the model accuracy by exploiting lightweight computing on the 0-bit tokens. Moreover, due to the dynamic mixed-precision tokens caused by our framework, previous matrix-multiplication accelerators (e.g., systolic array) cannot effectively exploit the benefit of the compressed attention computation. We thus design our transformer accelerator with the variable-speed systolic array (VSSA) and propose an effective optimization strategy to alleviate the pipeline-stall problem in VSSA without hardware overhead. We conduct experiments with existing attention-based NLP models, including BERT and GPT-2 on various language tasks. Our results show that DTATrans outperforms the previous neural network accelerator Eyeriss by 16.04x in terms of speedup and 3.62x in terms of energy saving. Compared with the state-ofthe-art attention accelerator SpAtten, our DTATrans achieves at least 3.62x speedup and 4.22x energy efficiency improvement.

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference.

DPBERT: Efficient Inference for BERT based on Dynamic Planning

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

Revisiting Token Dropping Strategy in Efficient BERT Pretraining

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Retrofitting (Large) Language Models with Dynamic Tokenization

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

Dynamic and Efficient Inference for Text Generation Via BERT Family

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

bert2BERT: Towards Reusable Pretrained Language Models

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

EarlyBERT: Efficient BERT Training Via Early-bird Lottery Tickets

Token Cropr: Faster ViTs for Quite a Few Tasks

DTATrans: Leveraging Dynamic Token-Based Quantization with Accuracy Compensation Mechanism for Efficient Transformer Architecture.

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application