Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie,Weijia Mao,Zechen Bai,David Junhao Zhang,Weihao Wang,Kevin Qinghong Lin,Yuchao Gu,Zhijie Chen,Zhenheng Yang,Mike Zheng Shou
2024-10-14
Abstract:We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Code and models are released at <a class="link-external link-https" href="https://github.com/showlab/Show-o" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to construct a unified model capable of handling multimodal understanding and generation tasks simultaneously. Specifically, the authors propose a unified Transformer model named Show - o, aiming to integrate autoregressive and discrete diffusion modeling to adapt to different and mixed - modality inputs and outputs. The goal of this model is to flexibly support a wide range of vision - language tasks, such as visual question answering, text - to - image generation, text - guided inpainting/extrapolation, and mixed - modality generation. ### Main Problem Decomposition: 1. **Unification of Multimodal Understanding and Generation**: - Existing models usually consider multimodal understanding and generation as two separate tasks and use different models to handle them respectively. However, this separated approach limits the flexibility and generality of the model. - The authors hope to build a single Transformer model that can handle both multimodal understanding (such as visual question answering) and multimodal generation (such as text - to - image generation). 2. **Combination of Autoregressive and Diffusion Modeling**: - Autoregressive models perform well when dealing with sequence data, but they require a large number of sampling steps when generating images, which is less efficient. - Diffusion models perform excellently in image generation, especially in high - resolution image generation. The authors hope to improve the efficiency and performance of the model by combining these two methods. 3. **Support for Cross - Modal Tasks**: - Besides the basic multimodal understanding and generation tasks, the authors also hope that the model can naturally support other downstream applications, such as text - guided inpainting and extrapolation, without additional fine - tuning. - In addition, the model should also have the ability to handle mixed - modality generation, for example, generating video key frames through text descriptions. ### Solution Overview: - **Show - o Model**: Based on a pre - trained language model (LLM), Show - o extends the embedding layer to include discrete image tokens and processes different types of input data through a unified prompting strategy. - **Omni - Attention Mechanism**: An integrated attention mechanism is introduced, combining causal attention and full attention, to adapt to input sequences in different formats. - **Training Objectives**: Two learning objectives - next - token prediction (NTP) and mask - token prediction (MTP) - are adopted to achieve seamless integration of autoregressive and discrete diffusion modeling. Through these innovations, Show - o not only demonstrates performance comparable to or even better than existing models in multiple benchmark tests, but also significantly reduces the sampling steps required for image generation, improving efficiency. In addition, it also naturally supports multiple downstream applications, showing its great potential as a next - generation foundation model.