Abstract:Models based on the attention mechanism, i.e., transformers, have shown extraordinary performance in natural language processing (NLP) tasks. However, their memory footprint, inference latency, and power consumption are still prohibitive for efficient inference at edge devices, even at data centers. To tackle this issue, we present an algorithm-architecture co-design named DTATrans. We find empirically that the tolerance to the noise varies from token to token in attention-based NLP models. This finding leads us to dynamically quantize different tokens with mixed levels of bits. Furthermore, we find that the overstrict quantization method causes a dilemma of the model accuracy and model compression ratio, which impels us to explore a method to compensate for the model accuracy when the compression ratio is high. Thus, in DTATrans, we design a compression framework that: 1) dynamically quantizes tokens while they are forwarded in the models; 2) jointly determines the ratio of each precision; and 3) compensate the model accuracy by exploiting lightweight computing on the 0-bit tokens. Moreover, due to the dynamic mixed-precision tokens caused by our framework, previous matrix-multiplication accelerators (e.g., systolic array) cannot effectively exploit the benefit of the compressed attention computation. We thus design our transformer accelerator with the variable-speed systolic array (VSSA) and propose an effective optimization strategy to alleviate the pipeline-stall problem in VSSA without hardware overhead. We conduct experiments with existing attention-based NLP models, including BERT and GPT-2 on various language tasks. Our results show that DTATrans outperforms the previous neural network accelerator Eyeriss by 16.04x in terms of speedup and 3.62x in terms of energy saving. Compared with the state-ofthe-art attention accelerator SpAtten, our DTATrans achieves at least 3.62x speedup and 4.22x energy efficiency improvement.

OTAS: An Elastic Transformer Serving System via Token Adaptation

Adaptive Token Sampling For Efficient Vision Transformers

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

A General and Efficient Training for Transformer via Token Expansion

In Search of Lost Online Test-time Adaptation: A Survey

Adaptive Spatial Tokenization Transformer for Salient Object Detection in Optical Remote Sensing Images.

DTATrans: Leveraging Dynamic Token-Based Quantization with Accuracy Compensation Mechanism for Efficient Transformer Architecture.

Efficient Video Transformers with Spatial-Temporal Token Selection

Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization

Token Shift Transformer for Video Classification

Vision Transformer with Super Token Sampling

Adaptive Token Selection for Efficient Detection Transformer with Dual Teacher Supervision

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

TEAFormers: TEnsor-Augmented Transformers for Multi-Dimensional Time Series Forecasting

Token Transformer: Can class token help window-based transformer build better long-range interactions?

Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

Remote Sensing Scene Classification via Second-Order Differentiable Token Transformer Network

TTST: A Top-k Token Selective Transformer for Remote Sensing Image Super-Resolution

ToSA: Token Selective Attention for Efficient Vision Transformers

TSVT: Token Sparsification Vision Transformer for Robust RGB-D Salient Object Detection

Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task