Diffusion Feedback Helps CLIP See Better

Wenxuan Wang,Quan Sun,Fan Zhang,Yepeng Tang,Jing Liu,Xinlong Wang

2024-08-24

Abstract:Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code is available at <a class="link-external link-https" href="https://github.com/baaivision/DIVA" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of the CLIP model's inadequacy in perceiving fine-grained visual details. Although CLIP performs excellently in various visual and multimodal tasks, it has significant deficiencies in recognizing visual details such as orientation, quantity, color, and structure. These deficiencies limit the perceptual capabilities of multimodal large language models (MLLMs) built on CLIP. The paper proposes a self-supervised post-training method that optimizes CLIP's visual representations through feedback generated by a diffusion model, thereby significantly enhancing CLIP's performance in fine-grained visual tasks. Specifically, the paper proposes a framework named DIVA (DIffusion model as a Visual Assistant for CLIP), which utilizes feedback generated by a text-to-image diffusion model to optimize CLIP's visual representations. By using only images (without relying on corresponding text), DIVA can significantly improve CLIP's performance in multiple benchmark tests, including fine-grained visual tasks, multimodal understanding, and segmentation tasks. Furthermore, experimental results show that DIVA not only enhances CLIP's fine-grained visual perception capabilities but also maintains CLIP's excellent generalization ability in zero-shot image classification and retrieval tasks.

Diffusion Feedback Helps CLIP See Better

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Improving CLIP Training with Language Rewrites

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

How Much Can CLIP Benefit Vision-and-Language Tasks?

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Explaining CLIP's performance disparities on data from blind/low vision users

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights