VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Jinguo Zhu,Xiaohan Ding,Yixiao Ge,Yuying Ge,Sijie Zhao,Hengshuang Zhao,Xiaohua Wang,Ying Shan
2023-12-15
Abstract:In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as a language model processes text. To accomplish this, we initially propose a novel image tokenizer-detokenizer framework for visual data, specifically designed to transform raw images into a sequence of continuous embeddings and reconstruct them accordingly. In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model. Consequently, VL-GPT can perform large-scale pre-training on multimodal corpora utilizing a unified auto-regressive objective (i.e., next-token prediction). Upon completion of pre-training, VL-GPT exhibits remarkable zero-shot and few-shot performance across a diverse range of vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, and more. Additionally, the pre-trained model retrains in-context learning capabilities when provided with multimodal prompts. We further conduct instruction tuning on our VL-GPT, highlighting its exceptional potential for multimodal assistance. The source code and model weights shall be released.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper introduces **VL-GPT** (Vision-Language Generative Pre-trained Transformer), a transformer model capable of handling both visual and language data simultaneously. The main objectives are as follows: 1. **Unified Pre-training Method**: - Proposes a new image encoder-decoder framework that converts raw images into continuous visual embeddings and can reconstruct these embeddings. - This allows the model to seamlessly handle both image and text data as it does with text, using a simple autoregressive objective to achieve unified pre-training. 2. **Performance Improvement in Multimodal Tasks**: - Demonstrates excellent zero-shot and few-shot performance on various vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, etc. - Shows the ability to learn in a multimodal context, effectively handling new tasks when given multimodal prompts. 3. **Efficient Training Framework**: - Utilizes pre-trained image encoders and high-quality image diffusion models to initialize the framework, improving training efficiency. - Achieves efficient bidirectional conversion by minimizing the loss function and simultaneously optimizing image-conditioned embeddings and text-conditioned embeddings. 4. **Multimodal Foundation Model**: - VL-GPT is expected to become a powerful foundational model in the multimodal research community, similar to the role of the GPT series in natural language processing. ### Main Contributions 1. **Proposed Image Encoder-Decoder Framework**: Achieves bidirectional conversion between images and continuous embeddings and explores effective training methods. 2. **Introduced VL-GPT Model**: Capable of pre-training on large-scale multimodal corpora in a unified autoregressive manner. 3. **Extensive Task Performance**: Excels in various vision and language understanding and generation tasks and possesses multimodal context learning capabilities.