Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Xuehai He,Weixi Feng,Tsu-Jui Fu,Varun Jampani,Arjun Akula,Pradyumna Narayana,Sugato Basu,William Yang Wang,Xin Eric Wang

2024-04-25

Abstract:Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via efficient attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily explores how to utilize pre-trained diffusion models (specifically Stable Diffusion) to solve image-text matching tasks and proposes a new method called **Discffusion**. #### Main Research Questions: 1. **How to transform a powerful generative model into a discriminative model?** - The core question of the paper is whether the Stable Diffusion model, which performs excellently in generative tasks, can be used to complete discriminative tasks such as image-text matching. 2. **How to effectively adapt to situations with a small number of samples?** - In few-shot scenarios, how to enable the model to quickly adapt to new tasks and perform well in image-text matching. #### Method Overview: - **Cross-Attention Score Calculation**: Extracting the mutual influence between visual and textual information by calculating the cross-attention matrix in the Stable Diffusion model. - **LogSumExp Pooling**: Aggregating these attention scores to obtain a single matching score. - **Attention-Based Prompt Learning**: Updating the key and value mappings from text to latent features in few-shot settings, allowing the model to learn new image-text concepts while retaining the ability to capture complex relationships. #### Experimental Results: - On the Compositional Visual Genome and RefCOCOg datasets, Discffusion outperformed CLIP-based methods, with accuracy improvements of 5.4% and 9.3% respectively in few-shot settings. - Additionally, this method demonstrated superior performance on the visual question answering task (VQAv2 dataset). Through these experimental results, the paper shows that diffusion models not only perform excellently in generative tasks but also have broad application potential in discriminative tasks.

Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Do text-free diffusion models learn discriminative visual representations?

Are Diffusion Models Vision-And-Language Reasoners?

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

Your Diffusion Model is Secretly a Zero-Shot Classifier

Unleashing Text-to-Image Diffusion Models for Visual Perception

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

Diffusion-Adapter: Text Guided Image Manipulation with Frozen Diffusion Models

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery.

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

Diffusion Model Alignment Using Direct Preference Optimization

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

De-Diffusion Makes Text a Strong Cross-Modal Interface

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?