Abstract:In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as a language model processes text. To accomplish this, we initially propose a novel image tokenizer-detokenizer framework for visual data, specifically designed to transform raw images into a sequence of continuous embeddings and reconstruct them accordingly. In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model. Consequently, VL-GPT can perform large-scale pre-training on multimodal corpora utilizing a unified auto-regressive objective (i.e., next-token prediction). Upon completion of pre-training, VL-GPT exhibits remarkable zero-shot and few-shot performance across a diverse range of vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, and more. Additionally, the pre-trained model retrains in-context learning capabilities when provided with multimodal prompts. We further conduct instruction tuning on our VL-GPT, highlighting its exceptional potential for multimodal assistance. The source code and model weights shall be released.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper introduces **VL-GPT** (Vision-Language Generative Pre-trained Transformer), a transformer model capable of handling both visual and language data simultaneously. The main objectives are as follows: 1. **Unified Pre-training Method**: - Proposes a new image encoder-decoder framework that converts raw images into continuous visual embeddings and can reconstruct these embeddings. - This allows the model to seamlessly handle both image and text data as it does with text, using a simple autoregressive objective to achieve unified pre-training. 2. **Performance Improvement in Multimodal Tasks**: - Demonstrates excellent zero-shot and few-shot performance on various vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, etc. - Shows the ability to learn in a multimodal context, effectively handling new tasks when given multimodal prompts. 3. **Efficient Training Framework**: - Utilizes pre-trained image encoders and high-quality image diffusion models to initialize the framework, improving training efficiency. - Achieves efficient bidirectional conversion by minimizing the loss function and simultaneously optimizing image-conditioned embeddings and text-conditioned embeddings. 4. **Multimodal Foundation Model**: - VL-GPT is expected to become a powerful foundational model in the multimodal research community, similar to the role of the GPT series in natural language processing. ### Main Contributions 1. **Proposed Image Encoder-Decoder Framework**: Achieves bidirectional conversion between images and continuous embeddings and explores effective training methods. 2. **Introduced VL-GPT Model**: Capable of pre-training on large-scale multimodal corpora in a unified autoregressive manner. 3. **Extensive Task Performance**: Excels in various vision and language understanding and generation tasks and possesses multimodal context learning capabilities.

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

MAGVLT: Masked Generative Vision-and-Language Transformer

A Survey of Vision-Language Pre-Trained Models

Vision-and-Language Navigation Generative Pretrained Transformer

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

GIT: A Generative Image-to-text Transformer for Vision and Language

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Unified Vision-Language Pre-Training for Image Captioning and VQA

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

GIVT: Generative Infinite-Vocabulary Transformers

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

RegionGPT: Towards Region Understanding Vision Language Model

BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained Transformer

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models