Dense Text-to-Image Generation with Attention Modulation

Yunji Kim,Jiyoung Lee,Jin-Hwa Kim,Jung-Woo Ha,Jun-Yan Zhu
2023-08-25
Abstract:Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is the inadequacy of existing text-to-image diffusion models in generating realistic images when dealing with dense captions, especially when each text prompt provides detailed descriptions of specific image regions. To tackle this challenge, the authors propose the DenseDiffusion method, a technique that adapts pre-trained text-to-image models to handle dense captions and provide control over scene layout without requiring additional training. By analyzing the relationship between the layout of generated images and the intermediate attention maps of the pre-trained model, DenseDiffusion develops an attention modulation method that guides objects to specific regions based on layout guidance. This improves the image generation performance for given dense captions without the need for additional fine-tuning or datasets, excelling in both automatic and human evaluation scores. Furthermore, DenseDiffusion can match the visual result quality of models specifically trained with layout conditions.