LLMs Meet Multimodal Generation and Editing: A Survey

Yingqing He,Zhaoyang Liu,Jingye Chen,Zeyue Tian,Hongyu Liu,Xiaowei Chi,Runtao Liu,Ruibin Yuan,Yazhou Xing,Wenhai Wang,Jifeng Dai,Yong Zhang,Wei Xue,Qifeng Liu,Yike Guo,Qifeng Chen
2024-06-09
Abstract:With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at <a class="link-external link-https" href="https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation" rel="external noopener nofollow">this https URL</a>
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Multimedia,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to utilize large - language models (LLMs) to enhance the ability of multimodal generation and editing. Specifically, the paper focuses on how to combine LLMs with non - text - modal data such as images, videos, 3D and audio to achieve higher - quality content generation and editing. By reviewing the existing technological progress, the paper explores the multiple roles of LLMs in different - modal generation, such as evaluators, labelers, instruction processors, planners, semantic guidance providers or as backbone architectures, and discusses the key technical components behind these methods and the multimodal datasets used. In addition, the paper also explores the security issues of generative AI, emerging applications and future development directions, aiming to provide a systematic and in - depth overview for the development of multimodal generation and world models. The main contributions of the paper include: - Providing the first systematic review covering the applications of LLMs in multimodal generation and editing, including images, videos, 3D and audio. - By comparing and analyzing the generation technologies in the pre - LLM era and the post - LLM era, providing a clear perspective on the progress and optimization of these methods. - Summarizing from a technical perspective the various roles of LLMs in each modal generation or editing process. - Discussing important AI security issues, investigating emerging applications, and exploring future development directions to promote the development of multimodal generation and world models. The paper not only focuses on technical details but also emphasizes the important role of LLMs in promoting multimodal content generation, especially in open - domain generation, which helps to design better unified generation models to handle multimodal data.