Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Xuehai He,Weixi Feng,Tsu-Jui Fu,Varun Jampani,Arjun Akula,Pradyumna Narayana,Sugato Basu,William Yang Wang,Xin Eric Wang
2024-04-25
Abstract:Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via efficient attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily explores how to utilize pre-trained diffusion models (specifically Stable Diffusion) to solve image-text matching tasks and proposes a new method called **Discffusion**. #### Main Research Questions: 1. **How to transform a powerful generative model into a discriminative model?** - The core question of the paper is whether the Stable Diffusion model, which performs excellently in generative tasks, can be used to complete discriminative tasks such as image-text matching. 2. **How to effectively adapt to situations with a small number of samples?** - In few-shot scenarios, how to enable the model to quickly adapt to new tasks and perform well in image-text matching. #### Method Overview: - **Cross-Attention Score Calculation**: Extracting the mutual influence between visual and textual information by calculating the cross-attention matrix in the Stable Diffusion model. - **LogSumExp Pooling**: Aggregating these attention scores to obtain a single matching score. - **Attention-Based Prompt Learning**: Updating the key and value mappings from text to latent features in few-shot settings, allowing the model to learn new image-text concepts while retaining the ability to capture complex relationships. #### Experimental Results: - On the Compositional Visual Genome and RefCOCOg datasets, Discffusion outperformed CLIP-based methods, with accuracy improvements of 5.4% and 9.3% respectively in few-shot settings. - Additionally, this method demonstrated superior performance on the visual question answering task (VQAv2 dataset). Through these experimental results, the paper shows that diffusion models not only perform excellently in generative tasks but also have broad application potential in discriminative tasks.