SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information

Jiashuo Sun,Jihai Zhang,Yucheng Zhou,Zhaochen Su,Xiaoye Qu,Yu Cheng
2024-09-21
Abstract:Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing. However, the full potential of LVLMs Retrieval-Augmented Generation (RAG) capabilities remains underutilized. Existing works either focus solely on the text modality or are limited to specific tasks. Moreover, most LVLMs struggle to selectively utilize retrieved information and are sensitive to irrelevant or misleading references. To address these challenges, we propose a self-refinement framework designed to teach LVLMs to Selectively Utilize Retrieved Information (SURf). Specifically, when given questions that are incorrectly answered by the LVLM backbone, we obtain references that help correct the answers (positive references) and those that do not (negative references). We then fine-tune the LVLM backbone using a combination of these positive and negative references. Our experiments across three tasks and seven datasets demonstrate that our framework significantly enhances LVLMs ability to effectively utilize retrieved multimodal references and improves their robustness against irrelevant or misleading information. The source code is available at <a class="link-external link-https" href="https://github.com/GasolSun36/SURf" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to make large vision - language models (LVLMs) utilize retrieved information more effectively, especially the robustness in the face of irrelevant or misleading information. Specifically, existing LVLMs have the following deficiencies: 1. **Single - modality attention**: Many existing multimodal retrieval - augmented generation (RAG) works only focus on the text modality and fail to fully utilize the LVLMs' understanding of visual content. 2. **Task limitations**: A small number of works that combine multimodal references are often limited to specific tasks, such as image caption generation, while ignoring the broad application potential of RAG technology in other tasks. 3. **Sensitivity issues**: Current LVLMs have difficulty in selectively using retrieved information and are very sensitive to irrelevant or misleading references, which leads to performance degradation. To solve these problems, the author proposes a self - refining framework SURf (Selective Utilization of Retrieved Information), aiming to teach LVLMs to selectively use retrieved information, thereby improving their performance and robustness in various tasks. ### Specific methods 1. **Wrong answer screening**: First, collect questions for which the LVLMs' initial answers are wrong. 2. **Positive and negative sample construction**: By retrieving the top - N images similar to the test image and their descriptions, re - attempt to answer the questions, and use an external evaluation tool to determine whether the answers have improved, thereby constructing positive and negative samples. 3. **Instruction tuning**: Use high - quality positive and negative sample pairs for fine - tuning, so that LVLMs can better learn from the retrieved information and ignore irrelevant or misleading content. ### Experimental results The author conducted extensive experiments through three tasks (VQA, image caption generation, image classification) and seven datasets. The results show that: - SURf significantly improves the effectiveness of LVLMs in using retrieved multimodal references. - The robustness of the model to irrelevant or misleading information has been significantly improved. - Compared with the baseline method, SURf achieves better performance on multiple tasks, especially showing higher stability when introducing irrelevant image - caption pairs. ### Summary This research not only shows how to improve the selective information utilization ability of LVLMs through the self - refining framework, but also reveals the challenges of current LVLMs in dealing with irrelevant information. Through this framework, LVLMs can more effectively deal with complex situations in practical applications and further improve their performance and reliability.