Abstract:Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation. Instead of proposing another specialized technique, we introduce a simple, unified framework to handle diverse conditional generation tasks involving a specific image-condition correlation. By learning a joint distribution over a correlated image pair (e.g. image and depth) with a diffusion model, our approach enables versatile capabilities via different inference-time sampling schemes, including controllable image generation (e.g. depth to image), estimation (e.g. image to depth), signal guidance, joint generation (image & depth), and coarse control. Previous attempts at unification often introduce significant complexity through multi-stage training, architectural modification, or increased parameter counts. In contrast, our simple formulation requires a single, computationally efficient training stage, maintains the standard model input, and adds minimal learned parameters (15% of the base model). Moreover, our model supports additional capabilities like non-spatially aligned and coarse conditioning. Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing conditional generation models usually require complex multi - stage training, architecture modification or an increase in the number of parameters for specific tasks when dealing with different types of image - condition pairs. Although these methods can achieve specific conditional generation tasks, they lack flexibility and universality and are difficult to support multiple conditional generation tasks simultaneously. Specifically, this paper proposes a unified diffusion model framework - UniCon, which aims to handle diverse conditional generation tasks by learning the joint distribution between images and conditions. These tasks include, but are not limited to: 1. **Controllable image generation**: For example, generating an image from a depth map (depth - to - image), text inpainting using an edge sketch (edge sketch text inpainting), etc. 2. **Estimation tasks**: For example, generating a depth map from an image (image - to - depth). 3. **Joint generation**: Simultaneously generating an image and its corresponding depth map (joint depth - image generation). 4. **Coarse - control**: Allowing generation using imprecise or partial condition signals (such as a coarse - depth map). By introducing a simple and efficient training strategy and a flexible sampling method, UniCon can achieve all of the above tasks in a single model without significantly increasing the model complexity or the number of parameters. Compared with previous unified methods, UniCon has the following advantages: - **Simplified training process**: Only one computationally efficient training stage is required, maintaining the standard model input and only adding a small number of learnable parameters (about 15%). - **Flexible conditional generation ability**: Supports non - spatially - aligned condition signals and coarse condition signals. - **Better performance**: Experimental results show that UniCon can produce results comparable to or even better than specialized methods in multiple modalities. In summary, the goal of this paper is to provide a general and efficient solution that enables a single model to flexibly handle multiple conditional generation tasks, thereby promoting the further development of the field of image generation.

A Simple Approach to Unifying Diffusion-based Conditional Generation

Conditional Image Synthesis with Diffusion Models: A Survey

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

One Diffusion to Generate Them All

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Conditional Image Generation with Pretrained Generative Model

Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory

Conditional Generation from Unconditional Diffusion Models using Denoiser Representations

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation

DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model

Dual Diffusion for Unified Image Generation and Understanding

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Conditional Controllable Image Fusion

Scene Diffusion: Text-driven Scene Image Synthesis Conditioning on a Single 3D Model

Consistent Human Image and Video Generation with Spatially Conditioned Diffusion