Abstract:With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at <a class="link-external link-https" href="https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to utilize large - language models (LLMs) to enhance the ability of multimodal generation and editing. Specifically, the paper focuses on how to combine LLMs with non - text - modal data such as images, videos, 3D and audio to achieve higher - quality content generation and editing. By reviewing the existing technological progress, the paper explores the multiple roles of LLMs in different - modal generation, such as evaluators, labelers, instruction processors, planners, semantic guidance providers or as backbone architectures, and discusses the key technical components behind these methods and the multimodal datasets used. In addition, the paper also explores the security issues of generative AI, emerging applications and future development directions, aiming to provide a systematic and in - depth overview for the development of multimodal generation and world models. The main contributions of the paper include: - Providing the first systematic review covering the applications of LLMs in multimodal generation and editing, including images, videos, 3D and audio. - By comparing and analyzing the generation technologies in the pre - LLM era and the post - LLM era, providing a clear perspective on the progress and optimization of these methods. - Summarizing from a technical perspective the various roles of LLMs in each modal generation or editing process. - Discussing important AI security issues, investigating emerging applications, and exploring future development directions to promote the development of multimodal generation and world models. The paper not only focuses on technical details but also emphasizes the important role of LLMs in promoting multimodal content generation, especially in open - domain generation, which helps to design better unified generation models to handle multimodal data.

LLMs Meet Multimodal Generation and Editing: A Survey

Retrieving Multimodal Information for Augmented Generation: A Survey

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

A Survey on Multimodal Large Language Models

Large Multimodal Agents: A Survey

Efficient Multimodal Large Language Models: A Survey

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

MM-LLMs: Recent Advances in MultiModal Large Language Models

A Survey of Multimodal Large Language Model from A Data-centric Perspective

A Survey on Benchmarks of Multimodal Large Language Models

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

DreamLLM: Synergistic Multimodal Comprehension and Creation

A Survey on Evaluation of Multimodal Large Language Models

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

LLMGA: Multimodal Large Language Model based Generation Assistant

From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models

Personalized Multimodal Large Language Models: A Survey

Multimodal Image Synthesis and Editing: The Generative AI Era

Multimodal Large Language Models: A Survey