CRViT: Vision transformer advanced by causality and inductive bias for image recognition

Faming Lu,Kunhao Jia,Xue Zhang,Lin Sun
DOI: https://doi.org/10.1007/s10489-024-05910-3
IF: 5.3
2024-12-04
Applied Intelligence
Abstract:Vision Transformer (ViT) has shown powerful potential in various vision tasks by exploiting Transformer's self-attention mechanism and global perception capability. However, to train a large number of network parameters, ViT requires a huge amount of data and number of computational resources, thus performing poorly on small and medium-sized datasets. Compared to ViT, convolutional networks maintain high accuracy despite the small amount of data due to the consideration of the inductive bias (IB). Besides, causal relationships can explore the underlying correlation of data structures, making the deep learning networks more intelligent. In this work, we propose a Causal Relationship Vision Transformer (CRViT), which refines ViT by fusing causal relationships and IB. We propose a random fourier features module that makes feature vectors independent of each other and uses convolution to learn correct correlation between feature vectors and extract causal features to introduce causal relationships in our network. The structure of convolutional downsampling significantly reduces the number of parameters of our model while introducing IB. Experimental validations underscore the data efficiency of CRViT, achieving a Top-1 accuracy of 80.6% on the ImageNet-1k dataset. This surpasses the ViT benchmark by 2.7% while concurrently reducing parameters by 92%. This enhanced performance is also consistent across smaller datasets, including T-ImageNet, CIFAR, and SVHN. We create the counterfactual dataset Colorful MNIST and experimentally demonstrate that causality is truly joined.
computer science, artificial intelligence
What problem does this paper attempt to address?