Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

Yangning Li,Yinghui Li,Xingyu Wang,Yong Jiang,Zhen Zhang,Xinran Zheng,Hui Wang,Hai-Tao Zheng,Philip S. Yu,Fei Huang,Jingren Zhou
2024-11-05
Abstract:Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of "dynamic" questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. The code and dataset will be open-sourced at <a class="link-external link-https" href="https://github.com/Alibaba-NLP/OmniSearch" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of the existing multi - modal Retrieval - Augmented Generation (mRAG) methods in handling complex dynamic problems. Specifically, current mRAG methods usually pre - define fixed retrieval processes, which lead to two main problems: 1. **Non - adaptive Retrieval Queries**: These rigid retrieval strategies cannot be adjusted according to the context of the problem or intermediate results, thus hindering the model's ability to re - retrieve for further understanding, verification or re - thinking of the problem. 2. **Overloaded Retrieval Queries**: One - time retrieval strategies impose too much burden on a single query, which may result in obtaining knowledge that is seemingly relevant but actually unimportant and cannot provide sufficient and precise knowledge support for problem - solving. To better evaluate the performance of existing mRAG methods in handling complex dynamic problems, the author constructs a new dataset - Dyn - VQA. This dataset contains three types of questions that require complex knowledge retrieval strategies: - **Questions with rapidly changing answers**: The answers to these questions are frequently updated, so additional retrieval steps need to be flexibly planned to ensure the timeliness and accuracy of information. - **Questions requiring multi - modal knowledge**: This type of questions involves knowledge of multiple modalities, requiring mRAG methods to be able to retrieve across different modalities. - **Multi - hop questions**: These questions require multiple reasoning steps to be answered, requiring mRAG methods to be able to perform multiple retrieval steps. In addition, the author also proposes an adaptive retrieval agent, OmniSearch, which aims to simulate human problem - solving behavior, dynamically decompose complex multi - modal problems into chains of sub - problems, and adjust subsequent retrieval actions in real - time according to the current retrieval content. In this way, OmniSearch can more flexibly deal with dynamic problems and provide more accurate and relevant information. In summary, the main contributions of this paper include: - Revealing that the existing VQA benchmark datasets fail to reflect the dynamic knowledge retrieval characteristics required by real - world problems and proposing the Dyn - VQA dataset. - Evaluating the performance of various mRAG methods through the Dyn - VQA dataset and demonstrating their deficiencies in handling dynamic problems. - Proposing OmniSearch, an adaptive retrieval agent capable of real - time retrieval action planning. - The experimental results prove the effectiveness of OmniSearch and provide directions for future mRAG research. These works are helpful for promoting the development of multi - modal Retrieval - Augmented Generation technology, especially in terms of the ability to handle complex dynamic problems.