Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Shiyin Dong,Mingrui Zhu,Kun Cheng,Nannan Wang,Xinbo Gao
2024-01-29
Abstract:The remarkable prowess of diffusion models in image generation has spurred efforts to extend their application beyond generative tasks. However, a persistent challenge exists in lacking a unified approach to apply diffusion models to visual perception tasks with diverse semantic granularity requirements. Our purpose is to establish a unified visual perception framework, capitalizing on the potential synergies between generative and discriminative models. In this paper, we propose Vermouth, a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. We emphasize that there is no necessity for incorporating a heavyweight or intricate decoder to transform diffusion models into potent representation learners. Extensive comparative evaluations against tailored discriminative models showcase the efficacy of our approach on zero-shot sketch-based image retrieval (ZS-SBIR), few-shot classification, and open-vocabulary semantic segmentation tasks. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the challenges of applying diffusion models to visual perception tasks. Specifically, the authors focus on: 1. **Lack of a unified method**: Currently, there is a lack of a unified method that can apply diffusion models to visual perception tasks with different semantic granularity requirements. 2. **Differences in semantic granularity**: Visual perception tasks need to establish different decision boundaries \( p_\theta(y|x) \), which was not considered in the initial design of diffusion models. 3. **Internal prior knowledge extraction**: How to extract latent knowledge from diffusion models and use it for non - generative tasks (such as classification, image retrieval, and segmentation, etc.) remains an unsolved problem. To solve these problems, the authors proposed a framework named Vermouth, which combines the pre - trained Stable Diffusion (SD) model, a unified head (U - head), and an adapted expert to effectively support tasks with different semantic granularity requirements. #### Specific objectives - **Establish a unified framework**: Propose a unified framework that can apply diffusion models to various visual perception tasks without the need for task - specific design for each task. - **Fuse generative and discriminative models**: Improve performance in visual perception tasks by introducing a unified head (U - head) to fuse the prior knowledge of generative models and the prior knowledge of discriminative models. - **Explore the influence of time steps**: Study the influence of different time steps on feature extraction to optimize model performance. #### Main contributions - Proposed the first unified framework for applying diffusion models to visual perception tasks with different semantic granularity requirements. - Designed a unified head (U - head) that can effectively fuse the generative prior and the discriminative prior of the SD model. - Through experimental analysis, revealed the influence of hyper - parameters (such as noise level) on model performance, providing valuable insights for subsequent research. ### Formula summary - Forward step of the diffusion process: \[ z_t \sim q(z_t|z_{t - 1})=\mathcal{N}(\sqrt{1-\beta_t}z_{t - 1},(\beta_t)I) \] - Expression after re - parameterization: \[ q(z_t|z_0)=\mathcal{N}(z_t;\sqrt{\bar{\alpha}_t}z_0,(1 - \bar{\alpha}_t)I) \] where, \[ \bar{\alpha}_t=\prod_{s = 1}^{t}\alpha_s=\prod_{s = 1}^{t}(1-\beta_s) \] - Sampling process of the diffusion model: \[ p(z_{0:T})=p(z_T)\prod_{t = 1}^{T}p(z_{t - 1}|z_t) \] - Simple loss function: \[ L_{\text{simple}}=\mathbb{E}_{z_t,\epsilon\sim\mathcal{N}(0,1)}[\|\epsilon-\epsilon_\theta(z_t;t,c)\|^2_2] \] - Align text features: \[ p(y|x)=\frac{v\cdot t}{\|v\|\cdot\|t\|} \] Through these formulas and methods, the authors successfully demonstrated how to extend the advantages of diffusion models to visual perception tasks, providing a new research direction and technical means.