Abstract:Vision Transformers (ViTs) have recently achieved promising results in various computer vision tasks. However, ViTs have high computation costs and a large number of parameters due to the stacked multi-head self-attention (MHSA) and expanded feed-forward network (FFN) modules. Since the complexity of Transformer-based models is quadratic with the length of the input tokens, most current efforts focus on reducing the number of tokens in ViTs to improve the model efficiency. Unlike previous studies, we argue that diverse redundant features help ViTs understand the data comprehensively. In this paper, we propose GhostViT, which achieves both computation and storage efficiency. The key concept of GhostViT is to generate more diverse features using cheap operations in the MHSA and FFN modules. We experimentally demonstrate that our GhostViT can significantly reduce both the parameters and FLOPs of ViTs while achieving the similar or better accuracy. For example, about 14% of parameters and 17% of FLOPs of the DeiT-tiny model are reduced without any accuracy loss on the ImageNet-1 K dataset. The codes and trained models can be found at https://github.com/HuCaoFighting/GhostViT .

GhostViT: Expediting Vision Transformers Via Cheap Operations