UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Wei Li,Xue Xu,Jiachen Liu,Xinyan Xiao

2024-06-06

Abstract:Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper proposes a solution to the limitations of text-to-image generation models. Existing text-to-image models mainly rely on textual prompts to generate images, but the conciseness of textual descriptions often makes it difficult to faithfully generate images with complex details, such as specific entities or scenes. To address this issue, the paper introduces a multi-modal conditional diffusion framework called UNIMO-G, which can handle intertwined textual and visual inputs, and achieve both text-driven and theme-driven image generation. UNIMO-G consists of two core components: a large-scale multi-modal language model (MLLM) for encoding multi-modal prompts, and a coding-based multi-modal input generator network for generating images. Effective training is performed through a two-stage training strategy: pre-training on a large-scale text-image pair dataset to develop conditional image generation capability, followed by fine-tuning with multi-modal prompts to achieve unified image generation capability. To construct multi-modal prompts, the paper utilizes a data processing pipeline involving language localization and image segmentation. UNIMO-G performs well in text-to-image generation and zero-shot theme-driven synthesis, particularly excelling in generating high-fidelity images from complex multi-modal prompts involving multiple image entities. Experimental results demonstrate that UNIMO-G outperforms existing models in multiple benchmark tests, including tasks of single and multi-entity theme-driven image generation, showcasing its superior performance in handling multi-modal instructions and generating high-quality images.

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

Multimodal Image-to-Image Translation via Mutual Information Estimation and Maximization

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

OmniGen: Unified Image Generation

Contextualized Diffusion Models for Text-Guided Image and Video Generation

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

A Simple Approach to Unifying Diffusion-based Conditional Generation

UniFL: Improve Latent Diffusion Model via Unified Feedback Learning

UNIMO: Towards Unified-Modal Understanding and Generation Via Cross-Modal Contrastive Learning

One Diffusion to Generate Them All

DiffusionGPT: LLM-Driven Text-to-Image Generation System

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond