PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Junfei Xiao,Zheng Xu,Alan Yuille,Shen Yan,Boyu Wang
2024-06-01
Abstract:This paper demonstrates that a progressively aligned language model can effectively bridge frozen vision encoders and large language models (LLMs). While the fundamental architecture and pre-training methods of vision encoders and LLMs have been extensively studied, the architecture and training strategy of vision-language adapters vary significantly across recent works. Our research undertakes a thorough exploration of the state-of-the-art perceiver resampler architecture and builds a strong baseline. However, we observe that the vision-language alignment with perceiver resampler exhibits slow convergence and limited scalability with a lack of direct supervision. To address this issue, we propose PaLM2-VAdapter, employing a progressively aligned language model as the vision-language adapter. Compared to the strong baseline with perceiver resampler, our method empirically shows faster convergence, higher performance, and stronger scalability. Extensive experiments across various Visual Question Answering (VQA) and captioning tasks on both images and videos demonstrate that our model exhibits state-of-the-art visual understanding and multi-modal reasoning capabilities. Notably, our method achieves these advancements with 30~70% fewer parameters than the state-of-the-art large vision-language models, marking a significant efficiency improvement.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively connect frozen visual encoders with large - language models (LLMs) to build powerful vision - language adapters. Although the basic architectures and pre - training methods of visual encoders and LLMs have been widely studied, existing vision - language adapters have significant differences in architecture and training strategies, especially in terms of convergence speed, performance and scalability. For this reason, the author proposes a method named PaLM2 - VAdapter. By using a gradually aligned language model as a vision - language adapter, it aims to improve the model's convergence speed, performance and scalability while reducing the number of parameters, thereby achieving more efficient visual understanding and multimodal reasoning capabilities. Specifically, the challenges mentioned in the paper include: 1. **Limitations of existing adapters**: Existing vision - language adapters such as Perceiver Resampler are effective, but have slow convergence speed and limited scalability on large - scale visual encoders. 2. **Parameter efficiency**: How to reduce the number of parameters of the model while maintaining high performance and improve computational efficiency. 3. **Performance improvement in multimodal tasks**: How to achieve better performance in visual question answering (VQA) and captioning tasks for images and videos. To address these challenges, the author proposes PaLM2 - VAdapter, which adopts a gradually aligned strategy and uses a small PaLM2 language model as an adapter. Through a two - stage training process, it achieves faster convergence speed, higher performance and stronger scalability. Experimental results show that this method has reached the state - of - the - art level in multiple vision - language benchmark tests while reducing the number of parameters by 30% to 70%.