Abstract:Image classification is a computer vision task where a model analyzes an image to categorize it into a specific label. Vision Transformers (ViT) improve this task by leveraging self-attention to capture complex patterns and long range relationships between image patches. However, a key challenge for ViTs is efficiently incorporating multiscale feature representations, which is inherent in CNNs through their hierarchical structure. In this paper, we introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features. Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information. These patches are organized into a graph based on spatial and feature similarities, with a Graph Attention Network (GAT) refining the node embeddings. Finally, a Transformer encoder captures long-range dependencies and complex interactions. The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the difficulty of effectively integrating multi - scale feature representations in the Visual Transformer (ViT) for image classification tasks. Specifically: 1. **Limitations of ViT**: - ViT captures global dependencies in images through the self - attention mechanism, but its fixed - size patch tokenization may lead to the neglect of fine - grained local details. - ViT usually requires large - scale datasets for effective training, which limits its application on smaller datasets. 2. **Importance of multi - scale features**: - Multi - scale feature representations are crucial for enhancing the performance of ViT in various visual tasks. It can capture objects and patterns at different scales, providing a more comprehensive understanding of image content. - Although Convolutional Neural Networks (CNN) naturally capture multi - scale features through a hierarchical structure, efficiently integrating this ability in Transformer - based models remains a challenge. To solve these problems, the paper proposes a new framework - Scale - Aware Graph Attention Vision Transformer (SAG - ViT). The main innovations of this framework include: - **High - fidelity feature map patch strategy**: Split patches from the multi - scale feature maps extracted from the pre - trained EfficientNet backbone network, retaining rich semantic information. - **Graph construction method based on k - connectivity and similarity**: Construct graphs according to spatial proximity and feature similarity to capture the complex spatial relationships between patches. - **Combination of Graph Attention Network (GAT) and Transformer encoder**: Use GAT to process information - rich graph embeddings, effectively model local and global dependencies in images, and then capture long - range dependencies and complex interactions through the Transformer encoder. Through these improvements, SAG - ViT can exhibit higher performance than other Transformer - based methods on multiple benchmark datasets, especially in image classification tasks. ### Formula summary - Patch extraction formula: \[ P_{i,j}=F[i\cdot k:(i + 1)\cdot k,j\cdot k:(j + 1)\cdot k,:] \] \[ U_k(F)=\{P_{i,j}|P_{i,j}=F[i\cdot k:(i + 1)\cdot k,j\cdot k:(j + 1)\cdot k,:]\} \] \[ p_{i,j}=\text{vec}(P_{i,j}) \] - Edge weight formula in graph construction: \[ A_{u,v}= \begin{cases} \exp\left(-\frac{\|x_u - x_v\|^2}{2\sigma^2}\right)&\text{if }v\in N_k(u)\\ 0&\text{otherwise} \end{cases} \] - Attention coefficient calculation formula in GAT: \[ \alpha_{u,v}=\frac{\exp(\text{LeakyReLU}(a^{\top}[Wx_u\|Wx_v]))}{\sum_{k\in N(u)}\exp(\text{LeakyReLU}(a^{\top}[Wx_u\|Wx_k]))} \] - Self - attention mechanism formula in Transformer encoder: \[ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V \] The introduction of these formulas enables SAG - ViT to process multi - scale features more effectively and achieve better performance in image classification tasks.

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

ScopeViT: Scale-aware Vision Transformer

Vision Transformer with Sparse Scan Prior

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

MPViT: Multi-Path Vision Transformer for Dense Prediction

Constituent Attention for Vision Transformers

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer.

MaxViT: Multi-Axis Vision Transformer

Scaling Vision Transformers

DctViT: Discrete Cosine Transform Meet Vision Transformers

FasterViT: Fast Vision Transformers with Hierarchical Attention

Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Auto-scaling Vision Transformers without Training

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

Data Augmentation Vision Transformer for Fine-grained Image Classification