Abstract:Vision Transformer (ViT) has performed remarkably in various computer vision tasks. Nonetheless, affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method, model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we investigate a binarized ViT model. Empirically, we observe that the existing binarization technology designed for Convolutional Neural Networks (CNN) cannot migrate well to a ViT's binarization task. We also find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the Attention module and the Value vector. Therefore, we propose a novel model binarization technique, called Group Superposition Binarization (GSB), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Analytically, model binarization can limit the parameter's search space during parameter updates while training a model. Therefore, the binarization process can actually play an implicit regularization role and help solve the problem of overfitting in the case of insufficient training data. Experiments on three datasets with limited numbers of training samples demonstrate that the proposed GSB model achieves state-of-the-art performance among the binary quantization schemes and exceeds its full-precision counterpart on some indicators. Code and models are available at: https://github.com/IMRL/GSB-Vision-Transformer.

SI-BiViT: Binarizing Vision Transformers with Spatial Interaction

BiViT: Extremely Compressed Binary Vision Transformers

GSB: Group superposition binarization for vision transformer with limited training samples

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models

SepViT: Separable Vision Transformer

BViT: Broad Attention based Vision Transformer

SegViT: Semantic Segmentation with Plain Vision Transformers

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

Making Vision Transformers Efficient from A Token Sparsification View

ViR:the Vision Reservoir

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

SPViT: Enabling Faster Vision Transformers Via Latency-Aware Soft Token Pruning

Super Vision Transformer

Vision Transformer with Sparse Scan Prior

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers.

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

SimViT: Exploring a Simple Vision Transformer with sliding windows

BSI-MVS: multi-view stereo network with bidirectional semantic information