OmniControlNet: Dual-stage Integration for Conditional Image Generation

Yilin Wang,Haiyang Xu,Xiang Zhang,Zeyuan Chen,Zhizhou Sha,Zirui Wang,Zhuowen Tu

2024-06-10

Abstract:We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condition generation algorithms) with a large model redundancy (separately trained models for different types of conditioning inputs). Our proposed OmniControlNet consolidates 1) the condition generation (e.g., HED edges, depth maps, user scribble, and animal pose) by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance. OmniControlNet achieves significantly reduced model complexity and redundancy while capable of producing images of comparable quality for conditioned text-to-image generation.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the algorithm and model redundancy issues in existing ControlNet models for conditional image generation. Specifically, ControlNet is a two-stage pipeline that includes: 1. **Conditional Generation Stage**: Using external algorithms to generate different types of image-level conditional inputs (such as edges, depth maps, user doodles, etc.). 2. **Text-to-Image Generation Stage**: Training separate diffusion models for each type of conditional input. Despite the significant success of ControlNet, it has two main issues: - In the conditional generation stage, each type of image-level condition requires a specific external algorithm to create. - In the text-to-image generation stage, independent models need to be trained for each type of conditional input, leading to high model complexity and redundancy. To address these issues, the paper proposes OmniControlNet, which significantly reduces algorithm complexity and model redundancy through the following two integrations: 1. **Multi-task Dense Prediction Integration**: Performing tasks such as edge detection, depth map generation, animal pose estimation, and doodle generation within a unified framework. 2. **Conditional Text-to-Image Generation Integration**: Handling four different types of conditional inputs within a unified framework and generating images under the guidance of text inversion. With these improvements, OmniControlNet can generate images of comparable quality to existing methods while significantly reducing model parameters and memory usage.

OmniControlNet: Dual-stage Integration for Conditional Image Generation

Expression Conditional Gan for Facial Expression-to-Expression Translation.

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Statistics Enhancement Generative Adversarial Networks for Diverse Conditional Image Synthesis

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

OminiControl: Minimal and Universal Control for Diffusion Transformer

ECNet: Effective Controllable Text-to-Image Diffusion Models

Condition-Aware Neural Network for Controlled Image Generation

CCM: Real-Time Controllable Visual Content Creation Using Text-to-Image Consistency Models

Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

OmniGen: Unified Image Generation

A one-to-many conditional generative adversarial network framework for multiple image-to-image translations

CCM: Adding Conditional Controls to Text-to-Image Consistency Models

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Adding Conditional Control to Text-to-Image Diffusion Models