Abstract:While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the effectiveness of relying solely on next-token prediction methods in multimodal tasks. Although next-token prediction has achieved significant progress in language models, its application in multimodal tasks remains limited, especially in generation and perception tasks. Currently, these tasks are mainly dominated by diffusion models (such as Stable Diffusion) and combination methods (such as CLIP combined with large language models). **Specifically, the paper attempts to solve the following problems:** 1. **Generation and Perception Capabilities in Multimodal Tasks**: - How to utilize a single Transformer model to achieve high-quality image, text, and video generation through next-token prediction. - How to surpass existing task-specific models in multimodal perception tasks (such as vision-language understanding). 2. **Simplifying Model Design**: - How to simplify the complex multimodal model design by converting all modal data (images, text, videos) into discrete tokens. - How to eliminate the need for diffusion models or combination architectures, thereby improving training and inference efficiency. 3. **Generalization Capability**: - How to build a general multimodal intelligent model capable of handling various modal data, rather than being limited to language models. ### Main Contributions of the Paper - **Proposing Emu3**: A multimodal generation and perception model based on a single Transformer model, entirely relying on next-token prediction. - **Performance Surpassing Existing Models**: Emu3 outperforms existing task-specific models such as SDXL and LLaVA-1.6 in multiple benchmarks, including image generation, vision-language understanding, and video generation. - **Simplifying Model Design**: By converting all modal data into discrete tokens, the complex multimodal model design is simplified, improving training and inference efficiency. - **Open Source**: Open-sourcing key technologies and models to support further research. ### Conclusion The paper demonstrates the potential of next-token prediction in multimodal tasks through Emu3, achieving significant performance improvements in both generation and perception tasks, simplifying model design, and providing a new path for building general multimodal intelligence.

Emu3: Next-Token Prediction is All You Need

Multimodal Latent Language Modeling with Next-Token Diffusion

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Emu: Generative Pretraining in Multimodality

Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

Generative Multimodal Models are In-Context Learners

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Mechanics of Next Token Prediction with Self-Attention

EMU: Effective Multi-Hot Encoding Net for Lightweight Scene Text Recognition with a Large Character Set.

Token-disentangling Mutual Transformer for multimodal emotion recognition

Multimodal Token Fusion for Vision Transformers

MoST: Multi-modality Scene Tokenization for Motion Prediction

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Better & Faster Large Language Models via Multi-token Prediction

End-to-end training of Multimodal Model and ranking Model

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Egocentric Early Action Prediction Via Multimodal Transformer-Based Dual Action Prediction

Chameleon: Mixed-Modal Early-Fusion Foundation Models