Abstract:Transformer-based networks have revolutionized visual tasks with their continuous innovation, leading to significant progress. However, the widespread adoption of Vision Transformers (ViT) is limited due to their high computational and parameter requirements, making them less feasible for resource-constrained mobile and edge computing devices. Moreover, existing lightweight ViTs exhibit limitations in capturing different granular features, extracting local features efficiently, and incorporating the inductive bias inherent in convolutional neural networks. These limitations somewhat impact the overall performance. To address these limitations, we propose an efficient ViT called Dual-Granularity Former (DGFormer). DGFormer mitigates these limitations by introducing two innovative modules: Dual-Granularity Attention (DG Attention) and Efficient Feed-Forward Network (Efficient FFN). In our experiments, on the image recognition task of ImageNet, DGFormer surpasses lightweight models such as PVTv2-B0 and Swin Transformer by 2.3% in terms of Top1 accuracy. On the object detection task of COCO, under RetinaNet detection framework, DGFormer outperforms PVTv2-B0 and Swin Transformer with increase of 0.5% and 2.4% in average precision (AP), respectively. Similarly, under Mask R-CNN detection framework, DGFormer exhibits improvement of 0.4% and 1.8% in AP compared to PVTv2-B0 and Swin Transformer, respectively. On the semantic segmentation task on the ADE20K, DGFormer achieves a substantial improvement of 2.0% and 2.5% in mean Intersection over Union (mIoU) over PVTv2-B0 and Swin Transformer, respectively. The code is open-source and available at: https://github.com/ISCLab-Bistu/DGFormer.git.

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

AA-Trans: Core Attention Aggregating Transformer with Information Entropy Selector for Fine-grained Visual Classification

Dual Path Transformer with Partition Attention

Constituent Attention for Vision Transformers

Attention-based Multi-scale ViT Fine-grained Visual Classification

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

A novel dual-granularity lightweight transformer for vision tasks

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

Data Augmentation Vision Transformer for Fine-grained Image Classification

RegionViT: Regional-to-Local Attention for Vision Transformers

TransFG: A Transformer Architecture for Fine-Grained Recognition

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Fusion of regional and sparse attention in Vision Transformers

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

Vicinity Vision Transformer

ASAFormer: Visual tracking with convolutional vision transformer and asymmetric selective attention

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION