Abstract:Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.

What problem does this paper attempt to address?

The paper attempts to address the issue that in existing large-scale text-to-image generation models, relying solely on text input cannot achieve precise control over the generation process, particularly in terms of object localization and the use of reference images. To solve this problem, the paper proposes GLIGEN (Grounded Language-to-Image Generation), a novel approach that enhances the capabilities of existing pre-trained text-to-image diffusion models by introducing new conditional input modalities (such as bounding boxes, key points, edge maps, etc.). The core of GLIGEN lies in maintaining the vast conceptual knowledge of the pre-trained model while injecting new conditional information into new layers by freezing the original model weights, thereby achieving open-world text-to-image generation capabilities. Specifically, GLIGEN aims to: 1. **Enhance the controllability of the generation model**: By introducing various conditional inputs (such as bounding boxes, key points, style images, etc.), the generation process becomes more controllable, allowing for precise localization and description of objects in the image. 2. **Achieve open-world generation capabilities**: It can generate new concepts that have not been seen before, not limited to the concepts present in the training data. 3. **Improve zero-shot performance**: Without additional supervision, GLIGEN outperforms existing supervised baseline methods on the COCO and LVIS datasets. Through these improvements, GLIGEN not only enhances the quality of image generation but also increases the flexibility and controllability of the generation process, making it more practical for real-world applications.

GLIGEN: Open-Set Grounded Text-to-Image Generation

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Emage: Non-Autoregressive Text-to-Image Generation

ReGround: Improving Textual and Spatial Grounding at No Cost

VGDIFFZERO: Text-To-Image Diffusion Models Can Be Zero-Shot Visual Grounders.

DiffusionGPT: LLM-Driven Text-to-Image Generation System

Controllable Text-to-Image Generation with GPT-4

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

CgT-GAN: CLIP-guided Text GAN for Image Captioning

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Compositional Text-to-Image Generation with Dense Blob Representations

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Open-vocabulary Object Segmentation with Diffusion Models

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

GlyphDiffusion: Text Generation as Image Generation

Learning Visual Grounding from Generative Vision and Language Model

Diffusion-Geo: A Two-Stage Controllable Text-To-Image Generative Model for Remote Sensing Scenarios

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Group Diffusion Transformers are Unsupervised Multitask Learners