Abstract:Few-shot learning aims to train models that can be generalized to novel classes with only a few samples. Recently, a line of works are proposed to enhance few-shot learning with accessible semantic information from class names. However, these works focus on improving existing modules such as visual prototypes and feature extractors of the standard few-shot learning framework. This limits the full potential use of semantic information. In this paper, we propose a novel few-shot learning framework that uses pre-trained language models based on contrastive learning. To address the challenge of alignment between visual features and textual embeddings obtained from text-based pre-trained language model, we carefully design the textual branch of our framework and introduce a metric module to generalize the cosine similarity. For better transferability, we let the metric module adapt to different few-shot tasks and adopt MAML to train the model via bi-level optimization. Moreover, we conduct extensive experiments on multiple benchmarks to demonstrate the effectiveness of our method.

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Few-Shot Image Classification**: The study explores how to leverage Pre-Trained Language Models to enhance the performance of image classification tasks in few-shot learning. Specifically, the paper proposes a new framework that aligns visual features with text embeddings through Contrastive Learning, thereby improving the effectiveness of few-shot image classification. 2. **Challenges in Aligning Visual Features and Text Embeddings**: Existing methods mainly focus on improving certain modules (such as visual prototypes and feature extractors) within the standard few-shot learning framework, without fully utilizing the powerful capabilities of pre-trained language models. Therefore, the paper designs a text branch and introduces a metric module to generalize similarity measures, addressing the alignment issue between visual features and text embeddings. 3. **Cross-Domain Adaptability and Performance in Multi-Shot Scenarios**: In addition to validating the model's effectiveness under standard few-shot settings, the paper also evaluates the model's performance in cross-domain scenarios (e.g., from miniImageNet to CUB dataset) and in scenarios with more samples (such as 10-shot, 30-shot, and 50-shot), verifying its stability and superiority under different conditions. In summary, the main objective of the paper is to explore how to better utilize pre-trained language models in few-shot image classification tasks and to propose an innovative method to solve the alignment challenge between visual features and text embeddings, thereby significantly improving the model's performance across various benchmarks.

FILM: How can Few-Shot Image Classification Benefit from Pre-Trained Language Models?

Improving Few-shot Text Classification via Pretrained Language Representations

Feature Transformation for Few-Shot Learning

Less is More: A Closer Look at Semantic-based Few-Shot Learning

When Low Resource NLP Meets Unsupervised Language Model: Meta-Pretraining then Meta-Learning for Few-Shot Text Classification (Student Abstract)

Bimodal semantic fusion prototypical network for few-shot classification

Semantic-Based Few-Shot Learning by Interactive Psychometric Testing

Efficient Few-Shot Classification Via Contrastive Pretraining on Web Data.

Collaboration of Pre-trained Models Makes Better Few-shot Learner

Simple Semantic-Aided Few-Shot Learning

Learning to Compare Relation: Semantic Alignment for Few-Shot Learning

SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification

Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning

Multimodal few-shot classification without attribute embedding

Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning

VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning

Few-Shot Classification with Contrastive Learning

Collect and Select: Semantic Alignment Metric Learning for Few-Shot Learning

LPN: Language-guided Prototypical Network for few-shot classification

Shaping Visual Representations with Language for Few-shot Classification

Language-guided Few-shot Semantic Segmentation