Retrieving Multimodal Information for Augmented Generation: A Survey

Ruochen Zhao,Hailin Chen,Weishi Wang,Fangkai Jiao,Xuan Long Do,Chengwei Qin,Bosheng Ding,Xiaobao Guo,Minzhi Li,Xingxuan Li,Shafiq Joty
2023-12-01
Abstract:As Large Language Models (LLMs) become popular, there emerged an important trend of using multimodality to augment the LLMs' generation ability, which enables LLMs to better interact with the world. However, there lacks a unified perception of at which stage and how to incorporate different modalities. In this survey, we review methods that assist and augment generative models by retrieving multimodal knowledge, whose formats range from images, codes, tables, graphs, to audio. Such methods offer a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. By providing an in-depth review, this survey is expected to provide scholars with a deeper understanding of the methods' applications and encourage them to adapt existing techniques to the fast-growing field of LLMs.
Computation and Language
What problem does this paper attempt to address?
The paper primarily explores how to enhance the capabilities of generative models through multimodal information retrieval and reviews the latest advancements in this field. Specifically, the paper focuses on the following aspects: 1. **Multimodal Information Retrieval**: Current large language models (LLMs) have limitations in their generative capabilities, such as a tendency to hallucinate, difficulty in handling arithmetic tasks, and a lack of interpretability. To overcome these issues, researchers are exploring the use of information from various modalities, such as images, code, tables, graphs, and audio, to enhance generative models. 2. **Methods to Enhance Generative Models**: The paper reviews various methods that improve the accuracy, reasoning ability, and robustness of generative models by retrieving knowledge from different modalities. These methods not only help address factual and reasoning issues but also improve the interpretability of the generated content. 3. **Retrieval-Augmented Generation (RAG) with Multimodal Data**: For different modalities such as text, images, code, structured knowledge, audio, and video, the paper provides a detailed overview of related research work and analyzes the specific applications and technical challenges for each modality. 4. **Future Directions**: The paper also points out some potential future directions in the field of multimodal retrieval-augmented generation, including designing more efficient retrieval systems and better integrating multimodal information. In summary, this paper aims to provide scholars with a comprehensive understanding framework to better apply existing technologies in the rapidly evolving field of LLMs.