Abstract:Large pre-trained vision-language models, such as CLIP [Radford et al. 2021], have demonstrated remarkable performance in few-shot image classification. To facilitate the rapid adaptation of CLIP in downstream tasks with limited visual samples, two primary frameworks have been proposed. The first framework centers on the image encoder and introduces a trainable visual classifier after the backbone to generate logits for each object class. Nevertheless, this framework heavily depends on limited visual features extracted by the pre-trained visual encoder, which can result in over-fitting issues. The second framework aims to optimize the text encoder by using trainable soft language prompts and computing logits for each class based on the similarity between image features and optimized prompt features. However, this framework encounters the issue of imperfect alignment between the representations extracted by the image and text encoders, making it difficult to fine-tune the language prompts using visual samples. This paper proposes a Multi-Modal Prototype Regularization (MMPR) method for CLIP-based few-shot fine-tuning for image classification. MMPR can address the challenges of effectively utilizing both image and text features. MMPR fine-tunes a classifier and regularizes its weights using both image-based (ImgPR) and text-based (TexPR) prototypes. ImgPR represents the mean of image representations within the same class, derived from the image encoder, to distill specific visual distribution knowledge for classifier adaptation. TexPR represents the hand-crafted prompt associated with the class, derived from the text encoder, to incorporate general encyclopedic knowledge and mitigate visual over-fitting. MMPR significantly leverages both image and text information without increasing computational complexity during the inference stage compared to existing methods. Experimental results on various challenging public benchmarks demonstrate the superiority of the proposed MMPR method over state-of-the-art methods.

Distinguishing Textual Prompt Importance: Image-Guided Text Weighting for CLIP-Based Few-shot Learning

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

Ta-Adapter: Enhancing few-shot CLIP with task-aware encoders

Exploring Soft Prompt Initialization Strategy for Few-Shot Continual Text Classification

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Instance-Guided Prompt Learning for Few-Shot Text Matching

Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Hierarchy-Aware Interactive Prompt Learning for Few-Shot Classification

SimCLIP: Refining Image-Text Alignment with Simple Prompts for Zero-/Few-shot Anomaly Detection

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

CLIP Guided Image-perceptive Prompt Learning for Image Enhancement

Deeply Coupled Cross-Modal Prompt Learning

Cross-coupled prompt learning for few-shot image recognition

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

Iterative Prompt Learning for Unsupervised Backlit Image Enhancement

Fine-Tuning for Few-shot Image Classification by Multimodal Prototype Regularization

Enhanced Prompt Learning for Few-shot Text Classification Method

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

Region Attention Fine-tuning with CLIP for Few-shot Classification