Abstract:Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve local conditional controlling. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper attempts to address the issue of achieving fine control over specific local regions (local control) in text-to-image generation tasks. Traditional global control methods, while capable of generating images similar to structural conditions, struggle to produce images that align with text prompts. Even with the addition of control masks, they can only generate concepts closest to local conditions, failing to accurately generate multiple objects or maintain the quality of other regions. Specifically, the paper raises the following issues: 1. **Local Control Requirement**: Users wish to have fine control over specific local regions of the image, while the remaining regions are generated based on the original text prompt. 2. **Local Control Dominance Issue**: Directly adding local conditions may cause the model to overly focus on the controlled regions, neglecting the generation of objects in other areas. 3. **Decline in Generation Quality**: Under local control conditions, the information disparity between different regions may lead to a decline in the quality of the generated image. To address these issues, the paper proposes a new local control method by introducing three techniques to enhance object generation, reduce repetition, and improve image quality. These techniques include: - **Regional Discriminate Loss**: Updating latent variables to enhance object generation in non-controlled regions. - **Focused Token Response**: Suppressing weaker attention scores to enhance object distinction. - **Feature Mask Constraint**: Reducing the decline in image quality caused by differences in control information. Through these methods, the paper aims to achieve high-precision local control while maintaining the overall quality of the generated image and its consistency with the text prompt.

Local Conditional Controlling for Text-to-Image Diffusion Models

DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

ECNet: Effective Controllable Text-to-Image Diffusion Models

Decoupling Control in Text-to-Image Diffusion Models

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Adding Conditional Control to Text-to-Image Diffusion Models

From text to mask: Localizing entities using the attention of text-to-image diffusion models

Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement

Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Controlled and Conditional Text to Image Generation with Diffusion Prior

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Test-time Conditional Text-to-Image Synthesis Using Diffusion Models

Conditional Text Image Generation with Diffusion Models

CCM: Adding Conditional Controls to Text-to-Image Consistency Models

Control Color: Multimodal Diffusion-based Interactive Image Colorization

CCM: Real-Time Controllable Visual Content Creation Using Text-to-Image Consistency Models

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask