SI-BiViT: Binarizing Vision Transformers with Spatial Interaction

Peng Yin,Xiaosu Zhu,Jingkuan Song,Lianli Gao,Heng Tao Shen
DOI: https://doi.org/10.1145/3664647.3680872
2024-01-01
Abstract:Binarized Vision Transformers (BiViTs) aim to facilitate the efficient and lightweight utilization of Vision Transformers (ViTs) on devices with limited computational resources. Yet, the current approach to binarizing ViT leads to a substantial performance decrease compared to the full-precision model, posing obstacles to practical deployment. By empirical study, we reveal that spatial interaction (SI) is a critical factor that impacts performance due to lack of token-level correlation, but previous work ignores this factor. To this end, we design a ViT binarization approach dubbed SI-BiViT to incorporate spatial interaction in the binarization process. Specifically, an SI module is placed alongside the Multi-Layer Perceptron (MLP) module to formulate the dual-branch structure. This structure not only leverages knowledge from pre-trained ViTs by distilling over the original MLP, but also enhances spatial interaction via the introduced SI module. Correspondingly, we design a decoupled training strategy to train these two branches more effectively. Importantly, our SI-BiViT is orthogonal to existing Binarized ViTs approaches and can be directly plugged. Extensive experiments demonstrate the strong flexibility and effectiveness of SI-BiViT by plugging our method into four classic ViT backbones in supporting three downstream tasks, including classification, detection, and segmentation. In particular, SI-BiViT enhances the classification performance of binarized ViTs by an average of 10.52% in Top-1 accuracy compared to the previous state-of-the-art. Codes are available at https://github.com/VL-Group/SI-BiViT
What problem does this paper attempt to address?