Abstract:We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Code and models are released at <a class="link-external link-https" href="https://github.com/showlab/Show-o" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to construct a unified model capable of handling multimodal understanding and generation tasks simultaneously. Specifically, the authors propose a unified Transformer model named Show - o, aiming to integrate autoregressive and discrete diffusion modeling to adapt to different and mixed - modality inputs and outputs. The goal of this model is to flexibly support a wide range of vision - language tasks, such as visual question answering, text - to - image generation, text - guided inpainting/extrapolation, and mixed - modality generation. ### Main Problem Decomposition: 1. **Unification of Multimodal Understanding and Generation**: - Existing models usually consider multimodal understanding and generation as two separate tasks and use different models to handle them respectively. However, this separated approach limits the flexibility and generality of the model. - The authors hope to build a single Transformer model that can handle both multimodal understanding (such as visual question answering) and multimodal generation (such as text - to - image generation). 2. **Combination of Autoregressive and Diffusion Modeling**: - Autoregressive models perform well when dealing with sequence data, but they require a large number of sampling steps when generating images, which is less efficient. - Diffusion models perform excellently in image generation, especially in high - resolution image generation. The authors hope to improve the efficiency and performance of the model by combining these two methods. 3. **Support for Cross - Modal Tasks**: - Besides the basic multimodal understanding and generation tasks, the authors also hope that the model can naturally support other downstream applications, such as text - guided inpainting and extrapolation, without additional fine - tuning. - In addition, the model should also have the ability to handle mixed - modality generation, for example, generating video key frames through text descriptions. ### Solution Overview: - **Show - o Model**: Based on a pre - trained language model (LLM), Show - o extends the embedding layer to include discrete image tokens and processes different types of input data through a unified prompting strategy. - **Omni - Attention Mechanism**: An integrated attention mechanism is introduced, combining causal attention and full attention, to adapt to input sequences in different formats. - **Training Objectives**: Two learning objectives - next - token prediction (NTP) and mask - token prediction (MTP) - are adopted to achieve seamless integration of autoregressive and discrete diffusion modeling. Through these innovations, Show - o not only demonstrates performance comparable to or even better than existing models in multiple benchmark tests, but also significantly reduces the sampling steps required for image generation, improving efficiency. In addition, it also naturally supports multiple downstream applications, showing its great potential as a next - generation foundation model.

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

MonoFormer: One Transformer for Both Diffusion and Autoregression

Meta-Transformer: A Unified Framework for Multimodal Learning

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

All in One: Exploring Unified Video-Language Pre-training

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

OmniGen: Unified Image Generation

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

A Multimodal Transformer for Live Streaming Highlight Prediction

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Emu: Generative Pretraining in Multimodality

One Diffusion to Generate Them All

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer