Abstract:In the era of foundation models, CLIP has emerged as a powerful tool for aligning text and visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from LLMs, enabling more effective zero-shot classification compared to CLIP's default name-specific prompts. (2) These textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings and DINO's visual features. (3) Finally, we prompt-tune CLIP's vision encoder through DINO-assisted supervision using the trained alignment module. This three-step process allows us to harness the best of visual and textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFter across 11 diverse image classification datasets.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **how to improve the performance of zero - shot classifiers in unlabeled image classification tasks**. Specifically, the authors propose a method named NoLA (No Labels Attached), which combines the advantages of self - supervised learning models (such as DINO) and large - language models (LLMs) to enhance the zero - shot classification performance of CLIP on unlabeled image sets. #### Main problem background 1. **Limitations of CLIP**: - CLIP is a powerful multimodal model that aligns text and image modalities into a common embedding space through contrastive learning. However, the training objective of CLIP results in poor performance of its visual features in fine - grained tasks. - Although CLIP has excellent performance in zero - shot classification tasks, in closed - set tasks in specific domains, further supervised fine - tuning is still required to match the performance of traditional methods. 2. **Advantages and limitations of self - supervised learning models**: - Self - supervised learning (SSL) pre - training models (such as DINO) are good at extracting rich visual features, but these models usually require an additional supervised linear probing step and rely on fully labeled data, which are often difficult to obtain in practical applications. 3. **Application of unlabeled data**: - In many cases, obtaining a large amount of labeled data is both expensive and difficult. Therefore, how to use unlabeled data for effective model fine - tuning has become an important research direction. #### The method proposed by NoLA NoLA solves the above problems through the following three key steps: 1. **Generate robust text feature embeddings**: - Use large - language models (LLMs) to generate class - specific descriptions, thereby constructing text feature embeddings that more accurately represent object classes. Compared with the default name - specific prompts of CLIP, this method can perform zero - shot classification more effectively. 2. **Pseudo - label generation and alignment module training**: - Use the generated text embeddings to generate pseudo - labels for the training set and train an alignment module to align the visual features of the self - supervised learning model (such as DINO) with the joint embedding space of the VLM (visual - language model). 3. **DINO - assisted visual encoder fine - tuning**: - Through DINO - assisted supervision, use the trained alignment module to perform prompt fine - tuning on the visual encoder of CLIP, and finally achieve a lightweight adaptation of the visual encoder. #### Main contributions - Propose a lightweight auto - labeled visual - language model fine - tuning method NoLA for classification tasks. - Use the knowledge base of large - language models to generate class description embeddings (CDE) and use them to construct a DINO - based label network (DL). - Evaluation was carried out on 11 widely recognized image classification data sets, and the results show that NoLA has a significant improvement over existing methods in zero - shot classification tasks, with an average absolute gain of 3.6%. Through these innovations, NoLA not only improves the performance of zero - shot classification, but also reduces the dependence on expensive labeled data, enabling the model to play a greater role in more application scenarios.

CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

CLIPCleaner: Cleaning Noisy Labels with CLIP

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

RankCLIP: Ranking-Consistent Language-Image Pretraining

Pre-Trained Vision-Language Models as Partial Annotators

The Solution for Language-Enhanced Image New Category Discovery

Learning with noisy labels using collaborative sample selection and contrastive semi-supervised learning

CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention

Long-CLIP: Unlocking the Long-Text Capability of CLIP

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference