Local Conditional Controlling for Text-to-Image Diffusion Models

Yibo Zhao,Liang Peng,Yang Yang,Zekai Luo,Hengjia Li,Yao Chen,Zheng Yang,Xiaofei He,Wei Zhao,qinglin lu,Boxi Wu,Wei Liu
2024-08-22
Abstract:Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve local conditional controlling. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper attempts to address the issue of achieving fine control over specific local regions (local control) in text-to-image generation tasks. Traditional global control methods, while capable of generating images similar to structural conditions, struggle to produce images that align with text prompts. Even with the addition of control masks, they can only generate concepts closest to local conditions, failing to accurately generate multiple objects or maintain the quality of other regions. Specifically, the paper raises the following issues: 1. **Local Control Requirement**: Users wish to have fine control over specific local regions of the image, while the remaining regions are generated based on the original text prompt. 2. **Local Control Dominance Issue**: Directly adding local conditions may cause the model to overly focus on the controlled regions, neglecting the generation of objects in other areas. 3. **Decline in Generation Quality**: Under local control conditions, the information disparity between different regions may lead to a decline in the quality of the generated image. To address these issues, the paper proposes a new local control method by introducing three techniques to enhance object generation, reduce repetition, and improve image quality. These techniques include: - **Regional Discriminate Loss**: Updating latent variables to enhance object generation in non-controlled regions. - **Focused Token Response**: Suppressing weaker attention scores to enhance object distinction. - **Feature Mask Constraint**: Reducing the decline in image quality caused by differences in control information. Through these methods, the paper aims to achieve high-precision local control while maintaining the overall quality of the generated image and its consistency with the text prompt.