Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Dongyang Liu,Shitian Zhao,Le Zhuo,Weifeng Lin,Yu Qiao,Hongsheng Li,Peng Gao
2024-08-06
Abstract:We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key insight is that a simple decoder-only transformer with multimodal Generative PreTraining (mGPT), utilizing the next-token prediction objective on massive interleaved text-image sequences, can learn broad and general multimodal capabilities, thereby illuminating photorealistic text-to-image generation. Building on these pretrained models, we propose Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text pairs to fully unlock their potential for high-aesthetic image synthesis at any resolution while maintaining their general multimodal capabilities. Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT), transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like flexible text-to-image generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multiturn visual question answering. Additionally, we analyze the differences and similarities between diffusion-based and autoregressive methods in a direct comparison.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenges faced by current autoregressive (AR) generative models in generating high-quality, flexible-resolution realistic images and attempts to bridge the performance gap between AR methods and diffusion models through the proposed Lumina-mGPT model. Specifically, the study addresses the following key issues: 1. **Effective Initialization Problem**: Existing AR image generation methods typically start training from randomly initialized transformers, which cannot fully leverage the advantages of large-scale pre-trained representations. The paper proposes using multimodal generative pre-training (mGPT) as the initial representation to improve image generation quality and accelerate the convergence speed of downstream tasks. 2. **Encoder-Decoder Architecture Limitation**: Some methods adopt complex encoder-decoder architectures, which not only increase the model's complexity but also limit the scalability of image generation and support for other modalities. Lumina-mGPT employs a simple decoder-only architecture, capable of handling text encoding and image token decoding within a single framework. 3. **Generation Resolution and Flexibility Limitation**: Current AR methods mainly rely on low-resolution images with center cropping for training, which simplifies the training process but reduces image quality and generation flexibility. Lumina-mGPT achieves flexible high-resolution image generation through a progressive supervised fine-tuning strategy (FP-SFT). 4. **Insufficient Task Expansion Capability**: Previous AR methods have primarily focused on text-to-image generation and have not explored the unification with other tasks (such as dense annotation and controllable image generation). Lumina-mGPT achieves unified modeling of multiple tasks through omnipotent supervised fine-tuning (Omni-SFT). In summary, the core objective of this study is to achieve high-quality, flexible-resolution image generation through the Lumina-mGPT model and to expand its multimodal task processing capabilities, thereby constructing a general model that can support various visual and language tasks.