Abstract:Despite the tremendous success in text-to-image generative models, localized text-to-image generation (that is, generating objects or features at specific locations in an image while maintaining a consistent overall generation) still requires either explicit training or substantial additional inference time. In this work, we show that localized generation can be achieved by simply controlling cross attention maps during inference. With no additional training, model architecture modification or inference time, our proposed cross attention control (CAC) provides new open-vocabulary localization abilities to standard text-to-image models. CAC also enhances models that are already trained for localized generation when deployed at inference time. Furthermore, to assess localized text-to-image generation performance automatically, we develop a standardized suite of evaluations using large pretrained recognition models. Our experiments show that CAC improves localized generation performance with various types of location information ranging from bounding boxes to semantic segmentation maps, and enhances the compositional capability of state-of-the-art text-to-image generative models.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the challenges faced by text-to-image generative models in generating objects or features at specific locations. Although existing text-to-image generative models (such as Stable Diffusion and DALL-E) can generate high-quality and diverse images based on arbitrary text prompts, these models primarily rely on text prompts to control the generated content, which is insufficient for many application scenarios. Specifically, users often wish to provide positional information (e.g., bounding boxes or semantic segmentation maps) to guide the model in generating specific elements at specific locations in the image. However, current pre-trained models have limitations when performing local generation, mainly in the following aspects: 1. **Inability to directly handle positional information**: Most existing models cannot take positional information as input. 2. **Insufficient compositional ability**: The models perform poorly when combining multiple objects or features. 3. **Requires additional training or inference time**: Existing solutions often require retraining the model, fine-tuning existing models, or combining multiple samples, which usually demand a large amount of data, resources, and time, making them unsuitable for practical applications. To address these issues, the authors propose a new method—Cross Attention Control (CAC). This method controls the cross-attention maps during inference, enabling pre-trained text-to-image generative models to achieve local generation without additional training, modifying the model architecture, or increasing inference time. Additionally, the authors developed a set of standardized evaluation metrics to automatically assess the performance of local text-to-image generation. ### Main Contributions 1. **No additional training required**: The CAC method enables pre-trained models to have local generation capabilities without additional training. 2. **No modification to model architecture**: No modifications to the model architecture are needed. 3. **No increase in inference time**: The method does not increase the model's inference time. 4. **Supports open vocabulary**: It is not limited to a fixed vocabulary or language parser and can handle various input texts. 5. **Enhances existing models**: It is applicable not only to standard text-to-image generative models but also enhances models already trained for local generation tasks. ### Experimental Results The authors validated the effectiveness of the CAC method through various experiments, including local generation tasks using bounding boxes and semantic segmentation maps. The experimental results show that the CAC method significantly improves local generation performance under different types of positional information and performs well in compositional generation tasks. Specifically, the CAC method better maintains the correctness and consistency of objects when generating complex scenes containing multiple objects and attributes.

Localized Text-to-Image Generation for Free via Cross Attention Control

Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control

Unpaired Salient Object Translation Via Spatial Attention Prior

Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation

Local Conditional Controlling for Text-to-Image Diffusion Models

Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Training-Free Location-Aware Text-to-Image Synthesis

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Text-to-image Generation Based on Spatial-Channel Attention and Semantic Redescription

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

CAGAN: Text-To-Image Generation with Combined Attention GANs

Cross-View Image Translation Based on Local and Global Information Guidance

Image Caption with Global-Local Attention

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

ReCo: Region-Controlled Text-to-Image Generation