Abstract:As Large Language Models (LLMs) become popular, there emerged an important trend of using multimodality to augment the LLMs' generation ability, which enables LLMs to better interact with the world. However, there lacks a unified perception of at which stage and how to incorporate different modalities. In this survey, we review methods that assist and augment generative models by retrieving multimodal knowledge, whose formats range from images, codes, tables, graphs, to audio. Such methods offer a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. By providing an in-depth review, this survey is expected to provide scholars with a deeper understanding of the methods' applications and encourage them to adapt existing techniques to the fast-growing field of LLMs.

What problem does this paper attempt to address?

The paper primarily explores how to enhance the capabilities of generative models through multimodal information retrieval and reviews the latest advancements in this field. Specifically, the paper focuses on the following aspects: 1. **Multimodal Information Retrieval**: Current large language models (LLMs) have limitations in their generative capabilities, such as a tendency to hallucinate, difficulty in handling arithmetic tasks, and a lack of interpretability. To overcome these issues, researchers are exploring the use of information from various modalities, such as images, code, tables, graphs, and audio, to enhance generative models. 2. **Methods to Enhance Generative Models**: The paper reviews various methods that improve the accuracy, reasoning ability, and robustness of generative models by retrieving knowledge from different modalities. These methods not only help address factual and reasoning issues but also improve the interpretability of the generated content. 3. **Retrieval-Augmented Generation (RAG) with Multimodal Data**: For different modalities such as text, images, code, structured knowledge, audio, and video, the paper provides a detailed overview of related research work and analyzes the specific applications and technical challenges for each modality. 4. **Future Directions**: The paper also points out some potential future directions in the field of multimodal retrieval-augmented generation, including designing more efficient retrieval systems and better integrating multimodal information. In summary, this paper aims to provide scholars with a comprehensive understanding framework to better apply existing technologies in the rapidly evolving field of LLMs.

Retrieving Multimodal Information for Augmented Generation: A Survey

Retrieving Multimodal Information for Augmented Generation: A Survey

LLMs Meet Multimodal Generation and Editing: A Survey

Multimodal Large Language Models: A Survey

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

A Survey of Multimodal Large Language Model from A Data-centric Perspective

A Survey on Multimodal Large Language Models

Retrieval-Augmented Generation for Large Language Models: A Survey

Multimodal Image Synthesis and Editing: The Generative AI Era

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

A Survey of Multimodal Composite Editing and Retrieval

Efficient Multimodal Large Language Models: A Survey

Personalized Multimodal Large Language Models: A Survey

Advanced Embedding Techniques in Multimodal Retrieval Augmented Generation A Comprehensive Study on Cross Modal AI Applications

A Survey on Benchmarks of Multimodal Large Language Models

Generative Multi-Modal Knowledge Retrieval with Large Language Models

Retrieval-Augmented Multimodal Language Modeling

Large Multimodal Agents: A Survey