Abstract:Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. Such ability stems from their capacity to identify common features shared between new and previously seen images while disregarding distractions such as background variations. However, for artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. In this paper, we propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches and encoding them using the pre-trained Vision Transformer (ViT) architecture. Specifically, we swap the class (CLS) token and patch tokens between the support and query sets to have the mutual attention, which enables each set to focus on the most useful information. This facilitates the strengthening of intra-class representations and promotes closer proximity between instances of the same class. For implementation, we adopt the ViT-based network architecture and utilize pre-trained model parameters obtained through self-supervision. By leveraging Masked Image Modeling as a self-supervised training task for pre-training, the pre-trained model yields semantically meaningful representations while successfully avoiding supervision collapse. We then employ a meta-learning method to fine-tune the last several layers and CLS token modules. Our strategy significantly reduces the num- ber of parameters that require fine-tuning while effectively uti- lizing the capability of pre-trained model. Extensive experiments show that our framework is simple, effective and computationally efficient, achieving superior performance as compared to the state-of-the-art baselines on five popular few-shot classification benchmarks under the 5-shot and 1-shot scenarios

Matching Multi-Scale Feature Sets in Vision Transformer for Few-Shot Classification

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Semantic Prompt Based Multi-Scale Transformer for Few-Shot Classification.

Multi-level adaptive few-shot learning network combined with vision transformer

Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Ts-vit: feature-enhanced transformer via token selection for fine-grained image recognition

Object Detection Via Multi-Scale Token Based on Vision Transformer.

MSViT: Training Multiscale Vision Transformers for Image Retrieval

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

ScopeViT: Scale-aware Vision Transformer

CSiT: A Multiscale Vision Transformer for Hyperspectral Image Classification.

Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification

Cross-Domain Hyperspectral Image Classification Based on Transformer

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

Cross-scale Vision Transformer for crowd localization

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

Siamese Transformer Networks for Few-shot Image Classification

SimViT: Exploring a Simple Vision Transformer with sliding windows

TATM: Task-Adaptive Token Matching for Few-Shot Transformer