GLIGEN: Open-Set Grounded Text-to-Image Generation

Yuheng Li,Haotian Liu,Qingyang Wu,Fangzhou Mu,Jianwei Yang,Jianfeng Gao,Chunyuan Li,Yong Jae Lee
2023-04-17
Abstract:Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Graphics,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue that in existing large-scale text-to-image generation models, relying solely on text input cannot achieve precise control over the generation process, particularly in terms of object localization and the use of reference images. To solve this problem, the paper proposes GLIGEN (Grounded Language-to-Image Generation), a novel approach that enhances the capabilities of existing pre-trained text-to-image diffusion models by introducing new conditional input modalities (such as bounding boxes, key points, edge maps, etc.). The core of GLIGEN lies in maintaining the vast conceptual knowledge of the pre-trained model while injecting new conditional information into new layers by freezing the original model weights, thereby achieving open-world text-to-image generation capabilities. Specifically, GLIGEN aims to: 1. **Enhance the controllability of the generation model**: By introducing various conditional inputs (such as bounding boxes, key points, style images, etc.), the generation process becomes more controllable, allowing for precise localization and description of objects in the image. 2. **Achieve open-world generation capabilities**: It can generate new concepts that have not been seen before, not limited to the concepts present in the training data. 3. **Improve zero-shot performance**: Without additional supervision, GLIGEN outperforms existing supervised baseline methods on the COCO and LVIS datasets. Through these improvements, GLIGEN not only enhances the quality of image generation but also increases the flexibility and controllability of the generation process, making it more practical for real-world applications.