Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

Sanghyeok Lee,Joonmyung Choi,Hyunwoo J. Kim
2024-04-01
Abstract:Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of achieving efficient and accurate token fusion methods in Vision Transformers (ViT). Specifically, the paper proposes a Multi-Criteria Token Fusion (MCTF) method that considers various relationships between tokens (such as similarity, informativeness, and size) to minimize information loss. Additionally, the method introduces a one-step look-ahead attention mechanism to improve the measurement of token informativeness and proposes a fine-tuning scheme based on token reduction consistency to further enhance the model's generalization performance. In the experimental section, MCTF demonstrates its superior performance on multiple Vision Transformer architectures, such as DeiT-T and DeiT-S, significantly reducing computational complexity while improving classification accuracy. For example, on DeiT-T, MCTF reduces FLOPs by approximately 44% and increases accuracy by 0.5%; on DeiT-S, it similarly reduces FLOPs by 44% and improves accuracy by 0.3%. Furthermore, MCTF achieves at least a 31% speedup on other ViT variants (such as T2T-ViT and LV-ViT) without any performance degradation. In summary, the paper aims to address the trade-off between speed and accuracy in existing token fusion methods through MCTF and demonstrates its effectiveness across various Vision Transformer architectures.