OmniControlNet: Dual-stage Integration for Conditional Image Generation

Yilin Wang,Haiyang Xu,Xiang Zhang,Zeyuan Chen,Zhizhou Sha,Zirui Wang,Zhuowen Tu
2024-06-10
Abstract:We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condition generation algorithms) with a large model redundancy (separately trained models for different types of conditioning inputs). Our proposed OmniControlNet consolidates 1) the condition generation (e.g., HED edges, depth maps, user scribble, and animal pose) by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance. OmniControlNet achieves significantly reduced model complexity and redundancy while capable of producing images of comparable quality for conditioned text-to-image generation.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the algorithm and model redundancy issues in existing ControlNet models for conditional image generation. Specifically, ControlNet is a two-stage pipeline that includes: 1. **Conditional Generation Stage**: Using external algorithms to generate different types of image-level conditional inputs (such as edges, depth maps, user doodles, etc.). 2. **Text-to-Image Generation Stage**: Training separate diffusion models for each type of conditional input. Despite the significant success of ControlNet, it has two main issues: - In the conditional generation stage, each type of image-level condition requires a specific external algorithm to create. - In the text-to-image generation stage, independent models need to be trained for each type of conditional input, leading to high model complexity and redundancy. To address these issues, the paper proposes OmniControlNet, which significantly reduces algorithm complexity and model redundancy through the following two integrations: 1. **Multi-task Dense Prediction Integration**: Performing tasks such as edge detection, depth map generation, animal pose estimation, and doodle generation within a unified framework. 2. **Conditional Text-to-Image Generation Integration**: Handling four different types of conditional inputs within a unified framework and generating images under the guidance of text inversion. With these improvements, OmniControlNet can generate images of comparable quality to existing methods while significantly reducing model parameters and memory usage.