Abstract:The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the cost issues associated with Vision-Language Pre-training (VLP). As the scale of models continues to grow, the computational cost required for end-to-end training of large models becomes increasingly high, making VLP research more and more unaffordable. Additionally, most existing VLP methods cannot flexibly utilize already existing unimodal pre-trained models, such as large-scale language models (LLMs), during pre-training. To solve these problems, the paper proposes BLIP-2, an efficient and versatile pre-training strategy that guides vision-language pre-training through a frozen image encoder and a frozen large-scale language model. Specifically, BLIP-2 bridges the gap between modalities through a lightweight Querying Transformer (Q-Former), which is pre-trained in two stages: 1. **Stage 1**: Guide vision-language representation learning from the frozen image encoder. 2. **Stage 2**: Guide vision-to-language generation learning from the frozen language model. Through this approach, BLIP-2 achieves state-of-the-art performance on various vision-language tasks, such as zero-shot visual question answering (VQA), image captioning, and image-text retrieval, with significantly reduced trainable parameters. For example, BLIP-2 outperforms Flamingo80B by 8.7% on zero-shot VQAv2 while using 54 times fewer trainable parameters. ### Main Contributions 1. **Effective Utilization of Frozen Unimodal Models**: BLIP-2 effectively bridges the gap between vision and language modalities through the two-stage pre-trained Q-Former, enabling the use of high-quality frozen image encoders and powerful frozen language models. 2. **Zero-Shot Image-to-Text Generation**: BLIP-2 can generate image descriptions through natural language instructions, demonstrating strong capabilities in visual knowledge reasoning, visual dialogue, and more. 3. **High Computational Efficiency**: By using frozen unimodal models and a lightweight Q-Former, BLIP-2's computational cost is far lower than existing methods. For instance, BLIP-2 outperforms Flamingo on zero-shot VQAv2 while using 54 times fewer trainable parameters. In summary, BLIP-2, through its innovative pre-training strategy, not only improves the performance of vision-language tasks but also significantly reduces computational costs, making VLP research more efficient and practical.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

Image As a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens.

Probabilistic Language-Image Pre-Training

Leveraging per Image-Token Consistency for Vision-Language Pre-training

DLIP: Distilling Language-Image Pre-training

VL-BEiT: Generative Vision-Language Pretraining

NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training

Frozen CLIP Models are Efficient Video Learners

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension