PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Junfei Xiao,Zheng Xu,Alan Yuille,Shen Yan,Boyu Wang

2024-06-01

Abstract:This paper demonstrates that a progressively aligned language model can effectively bridge frozen vision encoders and large language models (LLMs). While the fundamental architecture and pre-training methods of vision encoders and LLMs have been extensively studied, the architecture and training strategy of vision-language adapters vary significantly across recent works. Our research undertakes a thorough exploration of the state-of-the-art perceiver resampler architecture and builds a strong baseline. However, we observe that the vision-language alignment with perceiver resampler exhibits slow convergence and limited scalability with a lack of direct supervision. To address this issue, we propose PaLM2-VAdapter, employing a progressively aligned language model as the vision-language adapter. Compared to the strong baseline with perceiver resampler, our method empirically shows faster convergence, higher performance, and stronger scalability. Extensive experiments across various Visual Question Answering (VQA) and captioning tasks on both images and videos demonstrate that our model exhibits state-of-the-art visual understanding and multi-modal reasoning capabilities. Notably, our method achieves these advancements with 30~70% fewer parameters than the state-of-the-art large vision-language models, marking a significant efficiency improvement.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively connect frozen visual encoders with large - language models (LLMs) to build powerful vision - language adapters. Although the basic architectures and pre - training methods of visual encoders and LLMs have been widely studied, existing vision - language adapters have significant differences in architecture and training strategies, especially in terms of convergence speed, performance and scalability. For this reason, the author proposes a method named PaLM2 - VAdapter. By using a gradually aligned language model as a vision - language adapter, it aims to improve the model's convergence speed, performance and scalability while reducing the number of parameters, thereby achieving more efficient visual understanding and multimodal reasoning capabilities. Specifically, the challenges mentioned in the paper include: 1. **Limitations of existing adapters**: Existing vision - language adapters such as Perceiver Resampler are effective, but have slow convergence speed and limited scalability on large - scale visual encoders. 2. **Parameter efficiency**: How to reduce the number of parameters of the model while maintaining high performance and improve computational efficiency. 3. **Performance improvement in multimodal tasks**: How to achieve better performance in visual question answering (VQA) and captioning tasks for images and videos. To address these challenges, the author proposes PaLM2 - VAdapter, which adopts a gradually aligned strategy and uses a small PaLM2 language model as an adapter. Through a two - stage training process, it achieves faster convergence speed, higher performance and stronger scalability. Experimental results show that this method has reached the state - of - the - art level in multiple vision - language benchmark tests while reducing the number of parameters by 30% to 70%.

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

A-VL: Adaptive Attention for Large Vision-Language Models

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

eP-ALM: Efficient Perceptual Augmentation of Language Models

Bridging Vision and Language Spaces with Assignment Prediction

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

APoLLo: Unified Adapter and Prompt Learning for Vision Language Models

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

BRAVE: Broadening the visual encoding of vision-language models

PaLM-E: An Embodied Multimodal Language Model

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning