Abstract:CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives, thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator(ATG) to automatically generate the required texts in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at:

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to use CLIP's powerful cross - modal matching ability to improve image feature extraction in unimodal tasks, thereby enhancing CLIP's performance in downstream tasks, especially in zero - shot and few - shot image classification tasks. Specifically, although CLIP performs well in cross - modal matching tasks, in unimodal tasks such as image classification, its performance may be inferior to specially optimized models. Some studies directly use CLIP's image encoder for few - shot classification, which leads to an inconsistency between the pre - training objective and the feature extraction method, and thus affects the quality of the image feature representation. To solve this problem, the author proposes a new cross - modal neighbor representation method (CrOss - moDal nEighbor Representation, CODER) based on the distance structure between the image and its neighboring text, and generates high - quality text by introducing an AutoTextGenerator (ATG) to construct a more effective CODER. ### Main contributions: 1. **Propose CODER**: By using the image - text distance relationship of CLIP, construct a new image representation method to make the image features better align with CLIP's pre - training objective. 2. **Automatic text generation**: Design a data - and - training - free AutoTextGenerator (ATG) that can generate diverse and high - quality texts according to the class names of the target dataset, enhancing the neighbor text density of the image in the CLIP feature space. 3. **Apply to classification tasks**: Apply CODER to CLIP's zero - shot and few - shot image classification tasks. The experimental results show that CODER significantly improves CLIP's performance in these two types of tasks. ### Experimental results: - **Zero - shot image classification**: CODER significantly improves CLIP's zero - shot classification accuracy on multiple datasets. Especially, after using one - to - one specific CODER in the re - ranking stage, the performance is further improved. - **Few - shot image classification**: CODER - Adapter outperforms the existing CLIP few - shot non - training image classification methods in most datasets and under different numbers of samples. In conclusion, this paper effectively solves the problem of insufficient feature extraction of CLIP in unimodal tasks by proposing CODER and ATG, and significantly improves CLIP's performance in zero - shot and few - shot image classification tasks.

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

Long-CLIP: Unlocking the Long-Text Capability of CLIP

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

DiffCLIP: Few-shot Language-driven Multimodal Classifier

FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness

Improving CLIP Training with Language Rewrites

Adaptive CLIP for open-domain 3D model retrieval

Finetuning CLIP to Reason about Pairwise Differences

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Non-Contrastive Learning Meets Language-Image Pre-Training

Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery