Compositional Text-to-Image Generation with Dense Blob Representations

Weili Nie,Sifei Liu,Morteza Mardani,Chao Liu,Benjamin Eckart,Arash Vahdat

2024-05-14

Abstract:Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page:

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper proposes a solution to the challenge of generating images from complex text prompts. Existing text-to-image models struggle to accurately interpret complex descriptions, often misunderstanding context and ignoring key words. To address this issue, the paper proposes the use of dense blob representations to decompose scenes into visual primitives that contain detailed scene information. Each blob representation includes position, size, orientation parameters, as well as text sentences describing the appearance and attributes of objects. This approach allows for better control over image generation, and the simplicity of the blob representation facilitates user construction and manipulation. The paper develops a blob-based text-to-image diffusion model called BlobGEN, which utilizes a novel masked cross-attention module to decouple the fusion of blob representations and visual features by exclusively attending to local regions corresponding to each blob. Additionally, the paper introduces a new method that leverages large language models (LLMs) to generate blob representations from text prompts, exploiting the compositional nature and visual understanding capabilities of LLMs to tackle complex compositional image generation tasks. Experimental results demonstrate that BlobGEN achieves impressive zero-shot generation quality and controllable layout guidance on the MS-COCO dataset. When combined with LLMs, BlobGEN also exhibits superior numerical and spatial accuracy on benchmark tests for compositional image generation. In summary, the main contributions of the paper are the introduction of blob representations, the proposal of the BlobGEN model, and the design of an enhanced method using LLMs to improve the accuracy and controllability of text-to-image generation.

Compositional Text-to-Image Generation with Dense Blob Representations

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

BlobGEN-3D: Compositional 3D-Consistent Freeview Image Generation with 3D Blobs

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

Progressive Compositionality In Text-to-Image Generative Models

Generating Intermediate Representations for Compositional Text-To-Image Generation

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis.

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Improving Compositional Text-to-image Generation with Large Vision-Language Models

GLIGEN: Open-Set Grounded Text-to-Image Generation

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

A Diffusion-based Method for Multi-turn Compositional Image Generation

Text Pared into Scene Graph for Diverse Image Generation.

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Controllable Image Generation With Composed Parallel Token Prediction

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

DreamCom: Finetuning Text-guided Inpainting Model for Image Composition

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model