An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Zhiyu Tan,Mengping Yang,Luozheng Qin,Hao Yang,Ye Qian,Qiang Zhou,Cheng Zhang,Hao Li
2024-07-18
Abstract:One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key issues in the text-to-image generation process: 1. **Multilingual Support**: Existing text-to-image generation models primarily rely on the CLIP model for text encoding, but CLIP only supports English input and has a maximum token length limit of 77, causing inconvenience for non-English users and loss of contextual information. 2. **Text Representation Capability**: The text encoder capacity of the CLIP model is relatively small, limiting its text representation capability, which in turn affects the quality of the generated images. 3. **Long Text Processing**: For longer text descriptions, the CLIP model cannot effectively handle them, leading to information loss. To address these issues, the paper proposes a new three-stage training process called OmniDiffusion. This method enhances text representation capability by introducing large language models (LLMs) and designs a lightweight adapter module to connect LLMs with the visual information in existing diffusion models. The specific steps are as follows: 1. **Multilingual Text Alignment**: First, train the adapter module to align the text features of LLMs with the joint embedding space of the CLIP model, achieving alignment of multilingual text representations. 2. **End-to-End Text-to-Image Training**: On the basis of alignment, further optimize the adapter module and the UNet in the diffusion model to improve the quality of the generated images. 3. **High Aesthetic Fine-Tuning**: Finally, fine-tune the model with a high-quality image dataset to enhance the visual aesthetics of the generated images. Experimental results show that OmniDiffusion not only supports multiple language inputs but also generates high-quality images and performs well in multiple benchmark tests.