Abstract:To enhance the controllability of text-to-image diffusion models, current ControlNet-like models have explored various control signals to dictate image attributes. However, existing methods either handle conditions inefficiently or use a fixed number of conditions, which does not fully address the complexity of multiple conditions and their potential conflicts. This underscores the need for innovative approaches to manage multiple conditions effectively for more reliable and detailed image synthesis. To address this issue, we propose a novel framework, DynamicControl, which supports dynamic combinations of diverse control signals, allowing adaptive selection of different numbers and types of conditions. Our approach begins with a double-cycle controller that generates an initial real score sorting for all input conditions by leveraging pre-trained conditional generation models and discriminative models. This controller evaluates the similarity between extracted conditions and input conditions, as well as the pixel-level similarity with the source image. Then, we integrate a Multimodal Large Language Model (MLLM) to build an efficient condition evaluator. This evaluator optimizes the ordering of conditions based on the double-cycle controller's score ranking. Our method jointly optimizes MLLMs and diffusion models, utilizing MLLMs' reasoning capabilities to facilitate multi-condition text-to-image (T2I) tasks. The final sorted conditions are fed into a parallel multi-control adapter, which learns feature maps from dynamic visual conditions and integrates them to modulate ControlNet, thereby enhancing control over generated images. Through both quantitative and qualitative comparisons, DynamicControl demonstrates its superiority over existing methods in terms of controllability, generation quality and composability under various conditional controls.

MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-to-3D Generation

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

MVDream: Multi-view Diffusion for 3D Generation

Adding Conditional Control to Text-to-Image Diffusion Models

Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images

Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Volumetric Conditioning Module to Control Pretrained Diffusion Models for 3D Medical Images

Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion

Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation

MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion