A Simple Approach to Unifying Diffusion-based Conditional Generation

Xirui Li,Charles Herrmann,Kelvin C.K. Chan,Yinxiao Li,Deqing Sun,Chao Ma,Ming-Hsuan Yang
2024-10-15
Abstract:Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation. Instead of proposing another specialized technique, we introduce a simple, unified framework to handle diverse conditional generation tasks involving a specific image-condition correlation. By learning a joint distribution over a correlated image pair (e.g. image and depth) with a diffusion model, our approach enables versatile capabilities via different inference-time sampling schemes, including controllable image generation (e.g. depth to image), estimation (e.g. image to depth), signal guidance, joint generation (image & depth), and coarse control. Previous attempts at unification often introduce significant complexity through multi-stage training, architectural modification, or increased parameter counts. In contrast, our simple formulation requires a single, computationally efficient training stage, maintains the standard model input, and adds minimal learned parameters (15% of the base model). Moreover, our model supports additional capabilities like non-spatially aligned and coarse conditioning. Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing conditional generation models usually require complex multi - stage training, architecture modification or an increase in the number of parameters for specific tasks when dealing with different types of image - condition pairs. Although these methods can achieve specific conditional generation tasks, they lack flexibility and universality and are difficult to support multiple conditional generation tasks simultaneously. Specifically, this paper proposes a unified diffusion model framework - UniCon, which aims to handle diverse conditional generation tasks by learning the joint distribution between images and conditions. These tasks include, but are not limited to: 1. **Controllable image generation**: For example, generating an image from a depth map (depth - to - image), text inpainting using an edge sketch (edge sketch text inpainting), etc. 2. **Estimation tasks**: For example, generating a depth map from an image (image - to - depth). 3. **Joint generation**: Simultaneously generating an image and its corresponding depth map (joint depth - image generation). 4. **Coarse - control**: Allowing generation using imprecise or partial condition signals (such as a coarse - depth map). By introducing a simple and efficient training strategy and a flexible sampling method, UniCon can achieve all of the above tasks in a single model without significantly increasing the model complexity or the number of parameters. Compared with previous unified methods, UniCon has the following advantages: - **Simplified training process**: Only one computationally efficient training stage is required, maintaining the standard model input and only adding a small number of learnable parameters (about 15%). - **Flexible conditional generation ability**: Supports non - spatially - aligned condition signals and coarse condition signals. - **Better performance**: Experimental results show that UniCon can produce results comparable to or even better than specialized methods in multiple modalities. In summary, the goal of this paper is to provide a general and efficient solution that enables a single model to flexibly handle multiple conditional generation tasks, thereby promoting the further development of the field of image generation.