BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li,Dongxu Li,Silvio Savarese,Steven Hoi
2023-06-15
Abstract:The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the cost issues associated with Vision-Language Pre-training (VLP). As the scale of models continues to grow, the computational cost required for end-to-end training of large models becomes increasingly high, making VLP research more and more unaffordable. Additionally, most existing VLP methods cannot flexibly utilize already existing unimodal pre-trained models, such as large-scale language models (LLMs), during pre-training. To solve these problems, the paper proposes BLIP-2, an efficient and versatile pre-training strategy that guides vision-language pre-training through a frozen image encoder and a frozen large-scale language model. Specifically, BLIP-2 bridges the gap between modalities through a lightweight Querying Transformer (Q-Former), which is pre-trained in two stages: 1. **Stage 1**: Guide vision-language representation learning from the frozen image encoder. 2. **Stage 2**: Guide vision-to-language generation learning from the frozen language model. Through this approach, BLIP-2 achieves state-of-the-art performance on various vision-language tasks, such as zero-shot visual question answering (VQA), image captioning, and image-text retrieval, with significantly reduced trainable parameters. For example, BLIP-2 outperforms Flamingo80B by 8.7% on zero-shot VQAv2 while using 54 times fewer trainable parameters. ### Main Contributions 1. **Effective Utilization of Frozen Unimodal Models**: BLIP-2 effectively bridges the gap between vision and language modalities through the two-stage pre-trained Q-Former, enabling the use of high-quality frozen image encoders and powerful frozen language models. 2. **Zero-Shot Image-to-Text Generation**: BLIP-2 can generate image descriptions through natural language instructions, demonstrating strong capabilities in visual knowledge reasoning, visual dialogue, and more. 3. **High Computational Efficiency**: By using frozen unimodal models and a lightweight Q-Former, BLIP-2's computational cost is far lower than existing methods. For instance, BLIP-2 outperforms Flamingo on zero-shot VQAv2 while using 54 times fewer trainable parameters. In summary, BLIP-2, through its innovative pre-training strategy, not only improves the performance of vision-language tasks but also significantly reduces computational costs, making VLP research more efficient and practical.