Compositional Text-to-Image Generation with Dense Blob Representations

Weili Nie,Sifei Liu,Morteza Mardani,Chao Liu,Benjamin Eckart,Arash Vahdat
2024-05-14
Abstract:Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page:
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper proposes a solution to the challenge of generating images from complex text prompts. Existing text-to-image models struggle to accurately interpret complex descriptions, often misunderstanding context and ignoring key words. To address this issue, the paper proposes the use of dense blob representations to decompose scenes into visual primitives that contain detailed scene information. Each blob representation includes position, size, orientation parameters, as well as text sentences describing the appearance and attributes of objects. This approach allows for better control over image generation, and the simplicity of the blob representation facilitates user construction and manipulation. The paper develops a blob-based text-to-image diffusion model called BlobGEN, which utilizes a novel masked cross-attention module to decouple the fusion of blob representations and visual features by exclusively attending to local regions corresponding to each blob. Additionally, the paper introduces a new method that leverages large language models (LLMs) to generate blob representations from text prompts, exploiting the compositional nature and visual understanding capabilities of LLMs to tackle complex compositional image generation tasks. Experimental results demonstrate that BlobGEN achieves impressive zero-shot generation quality and controllable layout guidance on the MS-COCO dataset. When combined with LLMs, BlobGEN also exhibits superior numerical and spatial accuracy on benchmark tests for compositional image generation. In summary, the main contributions of the paper are the introduction of blob representations, the proposal of the BlobGEN model, and the design of an enhanced method using LLMs to improve the accuracy and controllability of text-to-image generation.