UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Wei Li,Xue Xu,Jiachen Liu,Xinyan Xiao
2024-06-06
Abstract:Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a solution to the limitations of text-to-image generation models. Existing text-to-image models mainly rely on textual prompts to generate images, but the conciseness of textual descriptions often makes it difficult to faithfully generate images with complex details, such as specific entities or scenes. To address this issue, the paper introduces a multi-modal conditional diffusion framework called UNIMO-G, which can handle intertwined textual and visual inputs, and achieve both text-driven and theme-driven image generation. UNIMO-G consists of two core components: a large-scale multi-modal language model (MLLM) for encoding multi-modal prompts, and a coding-based multi-modal input generator network for generating images. Effective training is performed through a two-stage training strategy: pre-training on a large-scale text-image pair dataset to develop conditional image generation capability, followed by fine-tuning with multi-modal prompts to achieve unified image generation capability. To construct multi-modal prompts, the paper utilizes a data processing pipeline involving language localization and image segmentation. UNIMO-G performs well in text-to-image generation and zero-shot theme-driven synthesis, particularly excelling in generating high-fidelity images from complex multi-modal prompts involving multiple image entities. Experimental results demonstrate that UNIMO-G outperforms existing models in multiple benchmark tests, including tasks of single and multi-entity theme-driven image generation, showcasing its superior performance in handling multi-modal instructions and generating high-quality images.