RAVEN: Multitask Retrieval Augmented Vision-Language Learning

Varun Nagaraj Rao,Siddharth Choudhary,Aditya Deshpande,Ravi Kumar Satzoda,Srikar Appalaraju
2024-06-27
Abstract:The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3\% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.
Computer Vision and Pattern Recognition,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The paper aims to address the unsustainability and resource barriers caused by the parameter expansion of current large-scale language models (such as the GPT series). Specifically, the paper explores the application of the Retrieval-Augmented Generation (RAG) method in Vision-Language Models (VLMs). Although the RAG method has achieved success in natural language processing, its exploration in VLMs remains limited, and existing methods have the following shortcomings: 1. **Single Task Limitation**: Most methods are only applicable to a single task and cannot comprehensively evaluate cross-task performance. 2. **High Pre-training Requirements**: They rely on retrieval-specific parameters for pre-training, which increases model complexity and resource consumption. 3. **Unclear Modality Priority**: The priority of image or text modality in the retrieval process is not clearly defined. 4. **Dataset Overlap**: Some studies use datasets for retrieval that overlap with pre-training or fine-tuning datasets, affecting the validity of the results. To address these issues, the paper proposes the RAVEN framework, a multi-task retrieval-augmented VLM framework that enhances the performance of the base VLM through efficient task-specific fine-tuning without the need for additional retrieval-specific parameters. Experimental results show that RAVEN achieves significant performance improvements in Image Captioning and Visual Question Answering (VQA) tasks, with a +1 CIDEr score increase on the MSCOCO and NoCaps datasets and about a 3% accuracy improvement on specific types of VQA questions. These results demonstrate the effectiveness of the RAG method applied to VLMs, opening up new directions for more efficient and sustainable multimodal learning.