Abstract:Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.

What problem does this paper attempt to address?

This paper attempts to solve the problem that in text - based visual generation models, it is difficult for natural language to accurately correlate the position and attribute information of multiple instances. Specifically, current text - based visual generation models perform poorly when dealing with complex compositions containing multiple instances, especially having limitations in describing the spatial positions and attributes of multiple instances. To solve this problem, the author introduced a new method - **ROICtrl**, which enhances the regional instance control ability in the diffusion model by combining ROI - Align and ROI - Unpool operations. ### Problem Background 1. **Limitations of Natural Language**: Natural language has ambiguity when describing the positions and attributes of multiple instances, resulting in limited effectiveness of text - based visual generation models when dealing with complex compositions. 2. **Deficiencies of Existing Methods**: - **Implicit Position Encoding**: Methods such as GLIGEN rely on implicit position encoding, which will lead to inaccurate coordinate injection. - **Explicit Attention Mask**: Methods such as MIGC and Instance Diffusion use explicit attention masks. Although they can improve spatial alignment, the computational cost is high. ### Solution To solve the above problems, the author proposed **ROICtrl**, and its main contributions are as follows: 1. **Introduction of ROI - Unpool Operation**: Inspired by ROI - Align in object detection, the author introduced the ROI - Unpool operation to restore the cropped ROI features to their original positions, thereby achieving efficient and accurate ROI injection. 2. **Design of ROICtrl Adapter**: ROICtrl, as an adapter, can be integrated into pre - trained diffusion models to achieve precise regional instance control. It is compatible with existing spatial control (such as ControlNet) and embedding control (such as IP - Adapter) plugins, expanding the application of these models in multi - instance generation. 3. **Proposal of ROICtrl - Bench Benchmark**: In order to more comprehensively evaluate the instance control ability, the author introduced ROICtrl - Bench, which covers templated and free - form instance descriptions and provides broader evaluation criteria. ### Method Overview 1. **Problem Definition**: The multi - instance generation task is defined as using the global description pg and n instance descriptions pri and their corresponding bounding box coordinates cri to describe the entire image. 2. **ROI - Unpool Operation**: Different from object detection, visual generation needs to "paste" the processed ROI features back to the original coordinates. The ROI - Unpool operation avoids coordinate quantization errors and improves spatial alignment. 3. **ROICtrl Adapter Design**: ROICtrl injects the global description and instance descriptions in parallel and combines them through a learnable fusion mechanism. The specific steps include: - **Instance Description Injection**: Extract ROI features from the spatial features and then inject the instance descriptions through the pre - trained cross - attention mechanism. - **Learnable Attention Fusion**: Dynamically weight - fuse the global attention output and the instance attention output to optimize the final feature representation. 4. **Training Objective**: ROICtrl adopts the standard diffusion loss function and adds a regularization term to reduce the influence of the global attention output, so as to better align the instance descriptions. ### Experimental Results The experimental results show that ROICtrl performs excellently in multiple benchmark tests, especially when dealing with small objects and free - form descriptions, significantly outperforming existing methods. In addition, ROICtrl also achieves better results in terms of spatial alignment and regional text alignment. In conclusion, this paper effectively solves the limitations of text - based visual generation models in dealing with multi - instance complex compositions by introducing the ROI - Unpool operation and the ROICtrl adapter, improving the accuracy and efficiency of the models.

ROICtrl: Boosting Instance Control for Visual Generation

InstanceDiffusion: Instance-level Control for Image Generation

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

ReCorD: Reasoning and Correcting Diffusion for HOI Generation

Multi-Region Text-Driven Manipulation of Diffusion Imagery

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Readout Guidance: Learning Control from Diffusion Features

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC

Local Conditional Controlling for Text-to-Image Diffusion Models

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

Generate Subgoal Images Before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts

Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression