Abstract:Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing. However, the full potential of LVLMs Retrieval-Augmented Generation (RAG) capabilities remains underutilized. Existing works either focus solely on the text modality or are limited to specific tasks. Moreover, most LVLMs struggle to selectively utilize retrieved information and are sensitive to irrelevant or misleading references. To address these challenges, we propose a self-refinement framework designed to teach LVLMs to Selectively Utilize Retrieved Information (SURf). Specifically, when given questions that are incorrectly answered by the LVLM backbone, we obtain references that help correct the answers (positive references) and those that do not (negative references). We then fine-tune the LVLM backbone using a combination of these positive and negative references. Our experiments across three tasks and seven datasets demonstrate that our framework significantly enhances LVLMs ability to effectively utilize retrieved multimodal references and improves their robustness against irrelevant or misleading information. The source code is available at <a class="link-external link-https" href="https://github.com/GasolSun36/SURf" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to make large vision - language models (LVLMs) utilize retrieved information more effectively, especially the robustness in the face of irrelevant or misleading information. Specifically, existing LVLMs have the following deficiencies: 1. **Single - modality attention**: Many existing multimodal retrieval - augmented generation (RAG) works only focus on the text modality and fail to fully utilize the LVLMs' understanding of visual content. 2. **Task limitations**: A small number of works that combine multimodal references are often limited to specific tasks, such as image caption generation, while ignoring the broad application potential of RAG technology in other tasks. 3. **Sensitivity issues**: Current LVLMs have difficulty in selectively using retrieved information and are very sensitive to irrelevant or misleading references, which leads to performance degradation. To solve these problems, the author proposes a self - refining framework SURf (Selective Utilization of Retrieved Information), aiming to teach LVLMs to selectively use retrieved information, thereby improving their performance and robustness in various tasks. ### Specific methods 1. **Wrong answer screening**: First, collect questions for which the LVLMs' initial answers are wrong. 2. **Positive and negative sample construction**: By retrieving the top - N images similar to the test image and their descriptions, re - attempt to answer the questions, and use an external evaluation tool to determine whether the answers have improved, thereby constructing positive and negative samples. 3. **Instruction tuning**: Use high - quality positive and negative sample pairs for fine - tuning, so that LVLMs can better learn from the retrieved information and ignore irrelevant or misleading content. ### Experimental results The author conducted extensive experiments through three tasks (VQA, image caption generation, image classification) and seven datasets. The results show that: - SURf significantly improves the effectiveness of LVLMs in using retrieved multimodal references. - The robustness of the model to irrelevant or misleading information has been significantly improved. - Compared with the baseline method, SURf achieves better performance on multiple tasks, especially showing higher stability when introducing irrelevant image - caption pairs. ### Summary This research not only shows how to improve the selective information utilization ability of LVLMs through the self - refining framework, but also reveals the challenges of current LVLMs in dealing with irrelevant information. Through this framework, LVLMs can more effectively deal with complex situations in practical applications and further improve their performance and reliability.

SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation

RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing

Fine-Grained Guidance for Retrievers: Leveraging LLMs' Feedback in Retrieval-Augmented Generation

Rethinking Overlooked Aspects in Vision-Language Models

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

Calibrated Self-Rewarding Vision Language Models

Vision-Language Models for Vision Tasks: A Survey

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Bridging the Preference Gap between Retrievers and LLMs