Abstract:In the era of foundation models, CLIP has emerged as a powerful tool for aligning text and visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from LLMs, enabling more effective zero-shot classification compared to CLIP's default name-specific prompts. (2) These textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings and DINO's visual features. (3) Finally, we prompt-tune CLIP's vision encoder through DINO-assisted supervision using the trained alignment module. This three-step process allows us to harness the best of visual and textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFter across 11 diverse image classification datasets.

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

DINOv2: Learning Robust Visual Features without Supervision

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Unified Lexical Representation for Interpretable Visual-Language Alignment

DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency.

Finetuning CLIP to Reason about Pairwise Differences

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

How Much Can CLIP Benefit Vision-and-Language Tasks?

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment