Abstract:The remarkable prowess of diffusion models in image generation has spurred efforts to extend their application beyond generative tasks. However, a persistent challenge exists in lacking a unified approach to apply diffusion models to visual perception tasks with diverse semantic granularity requirements. Our purpose is to establish a unified visual perception framework, capitalizing on the potential synergies between generative and discriminative models. In this paper, we propose Vermouth, a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. We emphasize that there is no necessity for incorporating a heavyweight or intricate decoder to transform diffusion models into potent representation learners. Extensive comparative evaluations against tailored discriminative models showcase the efficacy of our approach on zero-shot sketch-based image retrieval (ZS-SBIR), few-shot classification, and open-vocabulary semantic segmentation tasks. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges of applying diffusion models to visual perception tasks. Specifically, the authors focus on: 1. **Lack of a unified method**: Currently, there is a lack of a unified method that can apply diffusion models to visual perception tasks with different semantic granularity requirements. 2. **Differences in semantic granularity**: Visual perception tasks need to establish different decision boundaries \( p_\theta(y|x) \), which was not considered in the initial design of diffusion models. 3. **Internal prior knowledge extraction**: How to extract latent knowledge from diffusion models and use it for non - generative tasks (such as classification, image retrieval, and segmentation, etc.) remains an unsolved problem. To solve these problems, the authors proposed a framework named Vermouth, which combines the pre - trained Stable Diffusion (SD) model, a unified head (U - head), and an adapted expert to effectively support tasks with different semantic granularity requirements. #### Specific objectives - **Establish a unified framework**: Propose a unified framework that can apply diffusion models to various visual perception tasks without the need for task - specific design for each task. - **Fuse generative and discriminative models**: Improve performance in visual perception tasks by introducing a unified head (U - head) to fuse the prior knowledge of generative models and the prior knowledge of discriminative models. - **Explore the influence of time steps**: Study the influence of different time steps on feature extraction to optimize model performance. #### Main contributions - Proposed the first unified framework for applying diffusion models to visual perception tasks with different semantic granularity requirements. - Designed a unified head (U - head) that can effectively fuse the generative prior and the discriminative prior of the SD model. - Through experimental analysis, revealed the influence of hyper - parameters (such as noise level) on model performance, providing valuable insights for subsequent research. ### Formula summary - Forward step of the diffusion process: \[ z_t \sim q(z_t|z_{t - 1})=\mathcal{N}(\sqrt{1-\beta_t}z_{t - 1},(\beta_t)I) \] - Expression after re - parameterization: \[ q(z_t|z_0)=\mathcal{N}(z_t;\sqrt{\bar{\alpha}_t}z_0,(1 - \bar{\alpha}_t)I) \] where, \[ \bar{\alpha}_t=\prod_{s = 1}^{t}\alpha_s=\prod_{s = 1}^{t}(1-\beta_s) \] - Sampling process of the diffusion model: \[ p(z_{0:T})=p(z_T)\prod_{t = 1}^{T}p(z_{t - 1}|z_t) \] - Simple loss function: \[ L_{\text{simple}}=\mathbb{E}_{z_t,\epsilon\sim\mathcal{N}(0,1)}[\|\epsilon-\epsilon_\theta(z_t;t,c)\|^2_2] \] - Align text features: \[ p(y|x)=\frac{v\cdot t}{\|v\|\cdot\|t\|} \] Through these formulas and methods, the authors successfully demonstrated how to extend the advantages of diffusion models to visual perception tasks, providing a new research direction and technical means.

Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Unleashing Text-to-Image Diffusion Models for Visual Perception

Text-driven Visual Synthesis with Latent Diffusion Prior

Diffusion Models Need Visual Priors for Image Generation

Diffusion Models Trained with Large Data Are Transferable Visual Models

PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

Harnessing Diffusion Models for Visual Perception with Meta Prompts

What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

Semantic Image Synthesis Via Diffusion Models

DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery.

Open-vocabulary Object Segmentation with Diffusion Models

Diffusion Models For Multi-Modal Generative Modeling

Diffusion Models in Low-Level Vision: A Survey