Abstract:While models derived from Vision Transformers (ViTs) have been phonemically surging, pre-trained models cannot seamlessly adapt to arbitrary resolution images without altering the architecture and configuration, such as sampling the positional encoding, limiting their flexibility for various vision tasks. For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all input images to be resized to 1024$\times$1024. To overcome this limitation, we propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. Enabling transformers to handle images of varying sizes without retraining or rescaling, the use of MSA-Conv further reduces computational costs compared to global attention in ViT, which grows costly as image size increases. Later, we present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv, where two capacity enhancing strategies, namely Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing long-distance connections between tokens and enlarging the effective receptive field. Extensive experiments have been carried out to validate the overall effectiveness of TiC. Additionally, ablation studies confirm the performance improvement made by MSA-Conv and the two capacity enhancing strategies separately. Note that our proposal aims at studying an alternative to the global attention used in ViT, while MSA-Conv meets our goal by making TiC comparable to state-of-the-art on ImageNet-1K. Code will be released at <a class="link-external link-https" href="https://github.com/zs670980918/MSA-Conv" rel="external noopener nofollow">this https URL</a>.

Constituent Attention for Vision Transformers

Scratching Visual Transformer's Back with Uniform Attention

TiC: Exploring Vision Transformer in Convolution

Advancing Vision Transformers with Group-Mix Attention

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

ASAFormer: Visual tracking with convolutional vision transformer and asymmetric selective attention

You Only Need Less Attention at Each Stage in Vision Transformers

VSA: Learning Varied-Size Window Attention in Vision Transformers

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer.

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

Dual Path Transformer with Partition Attention

FAM: Improving columnar vision transformer with feature attention mechanism

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

Multi-Scale And Token Mergence: Make Your ViT More Efficient

Vision Transformer with Attention Map Hallucination and FFN Compaction

MaxViT: Multi-Axis Vision Transformer

Vision Big Bird: Random Sparsification for Full Attention

Fusion of regional and sparse attention in Vision Transformers

Vicinity Vision Transformer

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition