Abstract:We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key insight is that a simple decoder-only transformer with multimodal Generative PreTraining (mGPT), utilizing the next-token prediction objective on massive interleaved text-image sequences, can learn broad and general multimodal capabilities, thereby illuminating photorealistic text-to-image generation. Building on these pretrained models, we propose Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text pairs to fully unlock their potential for high-aesthetic image synthesis at any resolution while maintaining their general multimodal capabilities. Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT), transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like flexible text-to-image generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multiturn visual question answering. Additionally, we analyze the differences and similarities between diffusion-based and autoregressive methods in a direct comparison.

What problem does this paper attempt to address?

The paper aims to address the challenges faced by current autoregressive (AR) generative models in generating high-quality, flexible-resolution realistic images and attempts to bridge the performance gap between AR methods and diffusion models through the proposed Lumina-mGPT model. Specifically, the study addresses the following key issues: 1. **Effective Initialization Problem**: Existing AR image generation methods typically start training from randomly initialized transformers, which cannot fully leverage the advantages of large-scale pre-trained representations. The paper proposes using multimodal generative pre-training (mGPT) as the initial representation to improve image generation quality and accelerate the convergence speed of downstream tasks. 2. **Encoder-Decoder Architecture Limitation**: Some methods adopt complex encoder-decoder architectures, which not only increase the model's complexity but also limit the scalability of image generation and support for other modalities. Lumina-mGPT employs a simple decoder-only architecture, capable of handling text encoding and image token decoding within a single framework. 3. **Generation Resolution and Flexibility Limitation**: Current AR methods mainly rely on low-resolution images with center cropping for training, which simplifies the training process but reduces image quality and generation flexibility. Lumina-mGPT achieves flexible high-resolution image generation through a progressive supervised fine-tuning strategy (FP-SFT). 4. **Insufficient Task Expansion Capability**: Previous AR methods have primarily focused on text-to-image generation and have not explored the unification with other tasks (such as dense annotation and controllable image generation). Lumina-mGPT achieves unified modeling of multiple tasks through omnipotent supervised fine-tuning (Omni-SFT). In summary, the core objective of this study is to achieve high-quality, flexible-resolution image generation through the Lumina-mGPT model and to expand its multimodal task processing capabilities, thereby constructing a general model that can support various visual and language tasks.

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Emage: Non-Autoregressive Text-to-Image Generation

Multimodal Latent Language Modeling with Next-Token Diffusion

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

DiffusionGPT: LLM-Driven Text-to-Image Generation System

Generating Images with Multimodal Language Models

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Controllable Text-to-Image Generation with GPT-4

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Group Diffusion Transformers are Unsupervised Multitask Learners

Emu: Generative Pretraining in Multimodality

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond