Abstract:Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at <a class="link-external link-https" href="https://github.com/wl-zhao/VPD" rel="external noopener nofollow">this https URL</a>

From text to mask: Localizing entities using the attention of text-to-image diffusion models

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

Towards Better Text-to-Image Generation Alignment via Attention Modulation

Unleashing Text-to-Image Diffusion Models for Visual Perception

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Text-image Alignment for Diffusion-based Perception

Open-vocabulary Object Segmentation with Diffusion Models

SPDiffusion: Semantic Protection Diffusion for Multi-concept Text-to-image Generation

Local Conditional Controlling for Text-to-Image Diffusion Models

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt