Localized Text-to-Image Generation for Free via Cross Attention Control

Yutong He,Ruslan Salakhutdinov,J. Zico Kolter
DOI: https://doi.org/10.48550/arXiv.2306.14636
2023-06-26
Computer Vision and Pattern Recognition
Abstract:Despite the tremendous success in text-to-image generative models, localized text-to-image generation (that is, generating objects or features at specific locations in an image while maintaining a consistent overall generation) still requires either explicit training or substantial additional inference time. In this work, we show that localized generation can be achieved by simply controlling cross attention maps during inference. With no additional training, model architecture modification or inference time, our proposed cross attention control (CAC) provides new open-vocabulary localization abilities to standard text-to-image models. CAC also enhances models that are already trained for localized generation when deployed at inference time. Furthermore, to assess localized text-to-image generation performance automatically, we develop a standardized suite of evaluations using large pretrained recognition models. Our experiments show that CAC improves localized generation performance with various types of location information ranging from bounding boxes to semantic segmentation maps, and enhances the compositional capability of state-of-the-art text-to-image generative models.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the challenges faced by text-to-image generative models in generating objects or features at specific locations. Although existing text-to-image generative models (such as Stable Diffusion and DALL-E) can generate high-quality and diverse images based on arbitrary text prompts, these models primarily rely on text prompts to control the generated content, which is insufficient for many application scenarios. Specifically, users often wish to provide positional information (e.g., bounding boxes or semantic segmentation maps) to guide the model in generating specific elements at specific locations in the image. However, current pre-trained models have limitations when performing local generation, mainly in the following aspects: 1. **Inability to directly handle positional information**: Most existing models cannot take positional information as input. 2. **Insufficient compositional ability**: The models perform poorly when combining multiple objects or features. 3. **Requires additional training or inference time**: Existing solutions often require retraining the model, fine-tuning existing models, or combining multiple samples, which usually demand a large amount of data, resources, and time, making them unsuitable for practical applications. To address these issues, the authors propose a new method—Cross Attention Control (CAC). This method controls the cross-attention maps during inference, enabling pre-trained text-to-image generative models to achieve local generation without additional training, modifying the model architecture, or increasing inference time. Additionally, the authors developed a set of standardized evaluation metrics to automatically assess the performance of local text-to-image generation. ### Main Contributions 1. **No additional training required**: The CAC method enables pre-trained models to have local generation capabilities without additional training. 2. **No modification to model architecture**: No modifications to the model architecture are needed. 3. **No increase in inference time**: The method does not increase the model's inference time. 4. **Supports open vocabulary**: It is not limited to a fixed vocabulary or language parser and can handle various input texts. 5. **Enhances existing models**: It is applicable not only to standard text-to-image generative models but also enhances models already trained for local generation tasks. ### Experimental Results The authors validated the effectiveness of the CAC method through various experiments, including local generation tasks using bounding boxes and semantic segmentation maps. The experimental results show that the CAC method significantly improves local generation performance under different types of positional information and performs well in compositional generation tasks. Specifically, the CAC method better maintains the correctness and consistency of objects when generating complex scenes containing multiple objects and attributes.