Token Selection is a Simple Booster for Vision Transformers

Daquan Zhou,Qibin Hou,Linjie Yang,Xiaojie Jin,Jiashi Feng
DOI: https://doi.org/10.1109/tpami.2022.3208922
IF: 23.6
2022-01-01
IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Vision transformers have recently attained state-of-the-art results in visual recognition tasks. Their success is largely attributed to the self-attention component, which models the global dependencies among the image patches (tokens) and aggregates them into higher-level features. However, self-attention brings significant training difficulties to ViTs. Many recent works thus develop various new self-attention components to alleviate this issue. In this work, instead of developing complicated self-attention mechanism, we aim to explore simple approaches to fully release the potential of the vanilla self-attention. We first study the token selection behavior of self-attention and find that it suffers from a low diversity due to attention over-smoothing, which severely limits its effectiveness in learning discriminative token features. We then develop simple approaches to enhance selectivity and diversity for self-attention in token selection. The resulted token selector module can server as a drop-in module for various ViT backbones and consistently boost their performance. Significantly, they enable ViTs to achieve 84.6% top-1 classification accuracy on ImageNet with only 25 M parameters. When scaled up to 81 M parameters, the result can be further improved to 86.1%. In addition, we also present comprehensive experiments to demonstrate the token selector can be applied to a variety of transformer-based models to boost their performance for image classification, semantic segmentation and NLP tasks. Code is available at https://github.com/zhoudaquan/dvit_repo.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?