SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Shravan Venkatraman,Jaskaran Singh Walia,Joe Dhanith P R
2024-11-14
Abstract:Image classification is a computer vision task where a model analyzes an image to categorize it into a specific label. Vision Transformers (ViT) improve this task by leveraging self-attention to capture complex patterns and long range relationships between image patches. However, a key challenge for ViTs is efficiently incorporating multiscale feature representations, which is inherent in CNNs through their hierarchical structure. In this paper, we introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features. Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information. These patches are organized into a graph based on spatial and feature similarities, with a Graph Attention Network (GAT) refining the node embeddings. Finally, a Transformer encoder captures long-range dependencies and complex interactions. The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the difficulty of effectively integrating multi - scale feature representations in the Visual Transformer (ViT) for image classification tasks. Specifically: 1. **Limitations of ViT**: - ViT captures global dependencies in images through the self - attention mechanism, but its fixed - size patch tokenization may lead to the neglect of fine - grained local details. - ViT usually requires large - scale datasets for effective training, which limits its application on smaller datasets. 2. **Importance of multi - scale features**: - Multi - scale feature representations are crucial for enhancing the performance of ViT in various visual tasks. It can capture objects and patterns at different scales, providing a more comprehensive understanding of image content. - Although Convolutional Neural Networks (CNN) naturally capture multi - scale features through a hierarchical structure, efficiently integrating this ability in Transformer - based models remains a challenge. To solve these problems, the paper proposes a new framework - Scale - Aware Graph Attention Vision Transformer (SAG - ViT). The main innovations of this framework include: - **High - fidelity feature map patch strategy**: Split patches from the multi - scale feature maps extracted from the pre - trained EfficientNet backbone network, retaining rich semantic information. - **Graph construction method based on k - connectivity and similarity**: Construct graphs according to spatial proximity and feature similarity to capture the complex spatial relationships between patches. - **Combination of Graph Attention Network (GAT) and Transformer encoder**: Use GAT to process information - rich graph embeddings, effectively model local and global dependencies in images, and then capture long - range dependencies and complex interactions through the Transformer encoder. Through these improvements, SAG - ViT can exhibit higher performance than other Transformer - based methods on multiple benchmark datasets, especially in image classification tasks. ### Formula summary - Patch extraction formula: \[ P_{i,j}=F[i\cdot k:(i + 1)\cdot k,j\cdot k:(j + 1)\cdot k,:] \] \[ U_k(F)=\{P_{i,j}|P_{i,j}=F[i\cdot k:(i + 1)\cdot k,j\cdot k:(j + 1)\cdot k,:]\} \] \[ p_{i,j}=\text{vec}(P_{i,j}) \] - Edge weight formula in graph construction: \[ A_{u,v}= \begin{cases} \exp\left(-\frac{\|x_u - x_v\|^2}{2\sigma^2}\right)&\text{if }v\in N_k(u)\\ 0&\text{otherwise} \end{cases} \] - Attention coefficient calculation formula in GAT: \[ \alpha_{u,v}=\frac{\exp(\text{LeakyReLU}(a^{\top}[Wx_u\|Wx_v]))}{\sum_{k\in N(u)}\exp(\text{LeakyReLU}(a^{\top}[Wx_u\|Wx_k]))} \] - Self - attention mechanism formula in Transformer encoder: \[ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V \] The introduction of these formulas enables SAG - ViT to process multi - scale features more effectively and achieve better performance in image classification tasks.