Abstract:Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of "dynamic" questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. The code and dataset will be open-sourced at <a class="link-external link-https" href="https://github.com/Alibaba-NLP/OmniSearch" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of the existing multi - modal Retrieval - Augmented Generation (mRAG) methods in handling complex dynamic problems. Specifically, current mRAG methods usually pre - define fixed retrieval processes, which lead to two main problems: 1. **Non - adaptive Retrieval Queries**: These rigid retrieval strategies cannot be adjusted according to the context of the problem or intermediate results, thus hindering the model's ability to re - retrieve for further understanding, verification or re - thinking of the problem. 2. **Overloaded Retrieval Queries**: One - time retrieval strategies impose too much burden on a single query, which may result in obtaining knowledge that is seemingly relevant but actually unimportant and cannot provide sufficient and precise knowledge support for problem - solving. To better evaluate the performance of existing mRAG methods in handling complex dynamic problems, the author constructs a new dataset - Dyn - VQA. This dataset contains three types of questions that require complex knowledge retrieval strategies: - **Questions with rapidly changing answers**: The answers to these questions are frequently updated, so additional retrieval steps need to be flexibly planned to ensure the timeliness and accuracy of information. - **Questions requiring multi - modal knowledge**: This type of questions involves knowledge of multiple modalities, requiring mRAG methods to be able to retrieve across different modalities. - **Multi - hop questions**: These questions require multiple reasoning steps to be answered, requiring mRAG methods to be able to perform multiple retrieval steps. In addition, the author also proposes an adaptive retrieval agent, OmniSearch, which aims to simulate human problem - solving behavior, dynamically decompose complex multi - modal problems into chains of sub - problems, and adjust subsequent retrieval actions in real - time according to the current retrieval content. In this way, OmniSearch can more flexibly deal with dynamic problems and provide more accurate and relevant information. In summary, the main contributions of this paper include: - Revealing that the existing VQA benchmark datasets fail to reflect the dynamic knowledge retrieval characteristics required by real - world problems and proposing the Dyn - VQA dataset. - Evaluating the performance of various mRAG methods through the Dyn - VQA dataset and demonstrating their deficiencies in handling dynamic problems. - Proposing OmniSearch, an adaptive retrieval agent capable of real - time retrieval action planning. - The experimental results prove the effectiveness of OmniSearch and provide directions for future mRAG research. These works are helpful for promoting the development of multi - modal Retrieval - Augmented Generation technology, especially in terms of the ability to handle complex dynamic problems.

Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Self-adaptive Multimodal Retrieval-Augmented Generation

Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation

MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text

IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Searching for Best Practices in Retrieval-Augmented Generation

ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

Towards Multi-Source Retrieval-Augmented Generation via Synergizing Reasoning and Preference-Driven Retrieval

Hierarchical Retrieval-Augmented Generation Model with Rethink for Multi-hop Question Answering

Benchmarking Retrieval-Augmented Generation for Medicine

DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models

RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering

Retrieval-Augmented Generation for Large Language Models: A Survey