An Attention-Based Token Pruning Method for Vision Transformers

Kaicheng Luo,Huaxiong Li,Xianzhong Zhou,Bing Huang
DOI: https://doi.org/10.1007/978-3-031-21244-4_21
2022-01-01
Abstract:Recently, vision transformers have achieved impressive success in computer vision tasks. Nevertheless, these models suffer from heavy computational cost for the quadratic complexity of the self-attention mechanism, especially when dealing with high-resolution images. Previous literature has illustrated the sparsity of attention, which suggests that uninformative tokens could be discarded to accelerate the model with limited influence to precision. As a natural indicator of token importance, attention scores can be intuitively used to extract the discriminative regions in images. Inspired by these facts, we propose an attention-based token pruning framework to address the issue of inefficiency for vision transformers. We divide the transformer blocks in the model into pruning stages, where the integrated weights in multi-attention heads are fused to estimate the importance of token. The computational cost of the model is reduced by dropping redundant patches progressively after each pruning stage. Experiments conducted on ImageNet1k verify the effectiveness of our method, where the models pruned by our module outperform other state-of-the-art models with similar FLOPs. For fine-grained image recognition, our framework also improves both accuracy and efficiency of ViT on CUB200-2011. More significantly, the proposed attention-based pruning module could be simply plugged in to any vision transformer that contains the class token by fine-tuning only 10 epochs or a single epoch, making a reasonable trade-off between accuracy and cost.
What problem does this paper attempt to address?