Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

Sanghyeok Lee,Joonmyung Choi,Hyunwoo J. Kim

2024-04-01

Abstract:Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving efficient and accurate token fusion methods in Vision Transformers (ViT). Specifically, the paper proposes a Multi-Criteria Token Fusion (MCTF) method that considers various relationships between tokens (such as similarity, informativeness, and size) to minimize information loss. Additionally, the method introduces a one-step look-ahead attention mechanism to improve the measurement of token informativeness and proposes a fine-tuning scheme based on token reduction consistency to further enhance the model's generalization performance. In the experimental section, MCTF demonstrates its superior performance on multiple Vision Transformer architectures, such as DeiT-T and DeiT-S, significantly reducing computational complexity while improving classification accuracy. For example, on DeiT-T, MCTF reduces FLOPs by approximately 44% and increases accuracy by 0.5%; on DeiT-S, it similarly reduces FLOPs by 44% and improves accuracy by 0.3%. Furthermore, MCTF achieves at least a 31% speedup on other ViT variants (such as T2T-ViT and LV-ViT) without any performance degradation. In summary, the paper aims to address the trade-off between speed and accuracy in existing token fusion methods through MCTF and demonstrates its effectiveness across various Vision Transformer architectures.

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Multimodal Token Fusion for Vision Transformers

Multi-Scale And Token Mergence: Make Your ViT More Efficient

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection

CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

Making Vision Transformers Efficient from A Token Sparsification View

Constituent Attention for Vision Transformers

Efficient Vision Transformer via Token Merger

Token Fusion: Bridging the Gap between Token Pruning and Token Merging

Super Vision Transformer

Multi-Tailed Vision Transformer for Efficient Inference

Vision Transformer with Super Token Sampling

FMViT: A multiple-frequency mixing Vision Transformer

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Transformer with token attention and attribute prediction for image captioning

Multimodal Fusion Transformer for Remote Sensing Image Classification