Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

Chao Yi,Lu Ren,De-Chuan Zhan,Han-Jia Ye
2024-04-27
Abstract:CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives, thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator(ATG) to automatically generate the required texts in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at:
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to use CLIP's powerful cross - modal matching ability to improve image feature extraction in unimodal tasks, thereby enhancing CLIP's performance in downstream tasks, especially in zero - shot and few - shot image classification tasks. Specifically, although CLIP performs well in cross - modal matching tasks, in unimodal tasks such as image classification, its performance may be inferior to specially optimized models. Some studies directly use CLIP's image encoder for few - shot classification, which leads to an inconsistency between the pre - training objective and the feature extraction method, and thus affects the quality of the image feature representation. To solve this problem, the author proposes a new cross - modal neighbor representation method (CrOss - moDal nEighbor Representation, CODER) based on the distance structure between the image and its neighboring text, and generates high - quality text by introducing an AutoTextGenerator (ATG) to construct a more effective CODER. ### Main contributions: 1. **Propose CODER**: By using the image - text distance relationship of CLIP, construct a new image representation method to make the image features better align with CLIP's pre - training objective. 2. **Automatic text generation**: Design a data - and - training - free AutoTextGenerator (ATG) that can generate diverse and high - quality texts according to the class names of the target dataset, enhancing the neighbor text density of the image in the CLIP feature space. 3. **Apply to classification tasks**: Apply CODER to CLIP's zero - shot and few - shot image classification tasks. The experimental results show that CODER significantly improves CLIP's performance in these two types of tasks. ### Experimental results: - **Zero - shot image classification**: CODER significantly improves CLIP's zero - shot classification accuracy on multiple datasets. Especially, after using one - to - one specific CODER in the re - ranking stage, the performance is further improved. - **Few - shot image classification**: CODER - Adapter outperforms the existing CLIP few - shot non - training image classification methods in most datasets and under different numbers of samples. In conclusion, this paper effectively solves the problem of insufficient feature extraction of CLIP in unimodal tasks by proposing CODER and ATG, and significantly improves CLIP's performance in zero - shot and few - shot image classification tasks.