Abstract:One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality.

What problem does this paper attempt to address?

The paper aims to address several key issues in the text-to-image generation process: 1. **Multilingual Support**: Existing text-to-image generation models primarily rely on the CLIP model for text encoding, but CLIP only supports English input and has a maximum token length limit of 77, causing inconvenience for non-English users and loss of contextual information. 2. **Text Representation Capability**: The text encoder capacity of the CLIP model is relatively small, limiting its text representation capability, which in turn affects the quality of the generated images. 3. **Long Text Processing**: For longer text descriptions, the CLIP model cannot effectively handle them, leading to information loss. To address these issues, the paper proposes a new three-stage training process called OmniDiffusion. This method enhances text representation capability by introducing large language models (LLMs) and designs a lightweight adapter module to connect LLMs with the visual information in existing diffusion models. The specific steps are as follows: 1. **Multilingual Text Alignment**: First, train the adapter module to align the text features of LLMs with the joint embedding space of the CLIP model, achieving alignment of multilingual text representations. 2. **End-to-End Text-to-Image Training**: On the basis of alignment, further optimize the adapter module and the UNet in the diffusion model to improve the quality of the generated images. 3. **High Aesthetic Fine-Tuning**: Finally, fine-tune the model with a high-quality image dataset to enhance the visual aesthetics of the generated images. Experimental results show that OmniDiffusion not only supports multiple language inputs but also generates high-quality images and performs well in multiple benchmark tests.

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

LLMGA: Multimodal Large Language Model based Generation Assistant

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Elucidating the design space of language models for image generation

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

Generating Images with Multimodal Language Models

A Comprehensive Evaluation of Constrained Text Generation for Large Language Models.

Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Unified Text-to-Image Generation and Retrieval

StoryGPT-V: Large Language Models as Consistent Story Visualizers

Liquid: Language Models are Scalable Multi-modal Generators

A Framework for Image Text Retrieval Based on Large Language Model

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition