Abstract:Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost. Our code is available at <a class="link-external link-https" href="https://github.com/lyymuwu/TIMO" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two key issues in the contrastive learning framework CLIP in training - free few - shot classification (FSL) tasks: 1. **Severe anomalous match in image modality**: Existing methods usually model different modalities independently when dealing with them, which leads to a serious anomalous match problem in the image modality. Specifically, some defects introduced in the pre - training stage of CLIP make the similarity calculation between image features not accurate enough, especially in the few - shot case, which is prone to cause misclassification. 2. **Varying quality of generated text prompts**: The existing training - free FSL methods have uneven quality when generating text prompts, resulting in a large semantic gap between the text and image modalities, thus affecting the final classification performance. To solve these two problems, the author proposes a two - way guidance mechanism, namely **Text - Image Mutual guidance Optimization (TIMO)**. TIMO consists of two components: - **Image - Guided - Text (IGT)**: Correct the quality differences in generated text prompts through image representations. - **Text - Guided - Image (TGI)**: Alleviate the anomalous match problem in the image modality through text representations. By integrating IGT and TGI, TIMO can significantly improve the performance of few - shot classification under training - free conditions, and its enhanced version TIMO - S even surpasses the best training - required method while reducing the time cost by approximately 100 times. ### Formula summary - **Text - Guided - Image (TGI)**: - Calculate the weight matrix \( s_i \): \[ s_i=\frac{F_i^t W_i^v}{\|F_i^t\| \|W_i^v\|} \in \mathbb{R}^P \] - Adjust the weight matrix \( s_i \): \[ s_i = \text{diag}(I_\beta, 0_{P - \beta, P - \beta}) s_i \in \mathbb{R}^P \] - Construct TGI features: \[ F_{\text{TGI}}=\text{Concat}(F_v, F_t \odot S) \in \mathbb{R}^{N\times(K + \beta)\times D} \] - **Image - Guided - Text (IGT)**: - Define the optimization objective: \[ \max_{r_i} r_i^\top F_i^t W_i^v, \quad \text{s.t.} \quad \|r_i\|=\gamma \] - Use the Lagrange multiplier method to transform it into an unconstrained problem: \[ \min_{r_i} L=-r_i^\top F_i^t W_i^v+\lambda(r_i^\top r_i - \gamma) \] - Solve for the optimal solution: \[ \begin{cases} \lambda = \pm\frac{1}{2\gamma}\|F_i^t W_i^v\| \\ r_i=\pm\gamma\frac{F_i^t W_i^v}{\|F_i^t W_i^v\|} \end{cases} \] - Finally calculate logits: \[ R = \text{Sof}

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Distinguishing Textual Prompt Importance: Image-Guided Text Weighting for CLIP-Based Few-shot Learning

Improving CLIP Training with Language Rewrites

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

OT-CLIP: Understanding and Generalizing CLIP Via Optimal Transport

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Text-Guided Mixup Towards Long-Tailed Image Categorization

Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

Non-Contrastive Learning Meets Language-Image Pre-Training

Jina CLIP: Your CLIP Model Is Also Your Text Retriever