Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Yayuan Li,Jintao Guo,Lei Qi,Wenbin Li,Yinghuan Shi
2024-12-16
Abstract:Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost. Our code is available at <a class="link-external link-https" href="https://github.com/lyymuwu/TIMO" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two key issues in the contrastive learning framework CLIP in training - free few - shot classification (FSL) tasks: 1. **Severe anomalous match in image modality**: Existing methods usually model different modalities independently when dealing with them, which leads to a serious anomalous match problem in the image modality. Specifically, some defects introduced in the pre - training stage of CLIP make the similarity calculation between image features not accurate enough, especially in the few - shot case, which is prone to cause misclassification. 2. **Varying quality of generated text prompts**: The existing training - free FSL methods have uneven quality when generating text prompts, resulting in a large semantic gap between the text and image modalities, thus affecting the final classification performance. To solve these two problems, the author proposes a two - way guidance mechanism, namely **Text - Image Mutual guidance Optimization (TIMO)**. TIMO consists of two components: - **Image - Guided - Text (IGT)**: Correct the quality differences in generated text prompts through image representations. - **Text - Guided - Image (TGI)**: Alleviate the anomalous match problem in the image modality through text representations. By integrating IGT and TGI, TIMO can significantly improve the performance of few - shot classification under training - free conditions, and its enhanced version TIMO - S even surpasses the best training - required method while reducing the time cost by approximately 100 times. ### Formula summary - **Text - Guided - Image (TGI)**: - Calculate the weight matrix \( s_i \): \[ s_i=\frac{F_i^t W_i^v}{\|F_i^t\| \|W_i^v\|} \in \mathbb{R}^P \] - Adjust the weight matrix \( s_i \): \[ s_i = \text{diag}(I_\beta, 0_{P - \beta, P - \beta}) s_i \in \mathbb{R}^P \] - Construct TGI features: \[ F_{\text{TGI}}=\text{Concat}(F_v, F_t \odot S) \in \mathbb{R}^{N\times(K + \beta)\times D} \] - **Image - Guided - Text (IGT)**: - Define the optimization objective: \[ \max_{r_i} r_i^\top F_i^t W_i^v, \quad \text{s.t.} \quad \|r_i\|=\gamma \] - Use the Lagrange multiplier method to transform it into an unconstrained problem: \[ \min_{r_i} L=-r_i^\top F_i^t W_i^v+\lambda(r_i^\top r_i - \gamma) \] - Solve for the optimal solution: \[ \begin{cases} \lambda = \pm\frac{1}{2\gamma}\|F_i^t W_i^v\| \\ r_i=\pm\gamma\frac{F_i^t W_i^v}{\|F_i^t W_i^v\|} \end{cases} \] - Finally calculate logits: \[ R = \text{Sof}