A Survey on Multimodal Large Language Models

Shukang Yin,Chaoyou Fu,Sirui Zhao,Ke Li,Xing Sun,Tong Xu,Enhong Chen

2024-04-02

Abstract:Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at

Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

This paper presents the research focus of multimodal large language models (MLLM), which utilize large language models (LLM) as the core for handling various multimodal tasks. Compared to traditional multimodal methods, MLLM demonstrates novel capabilities such as generating stories from images and performing OCR-free mathematical reasoning, suggesting a possible path towards general artificial intelligence. In the paper, the authors introduce the basic components, training strategies, data, and evaluation methods of MLLM, and discuss how to extend its support for granularity, modality, language, and scenarios. In addition, they also discuss the problem of multimodal illusions and related techniques such as multimodal instruction classification (M-ICL), multimodal chain of thought (M-CoT), and LLM-assisted visual reasoning (LAVR). Finally, the paper points out the existing challenges and future research directions, emphasizing that research in this field will continue to evolve with the advent of the MLLM era.

A Survey on Multimodal Large Language Models

A Survey on Evaluation of Multimodal Large Language Models

A Survey of Multimodal Large Language Model from A Data-centric Perspective

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

Efficient Multimodal Large Language Models: A Survey

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Review of Multi-Modal Large Language and Vision Models

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Multimodal Large Language Models: A Survey

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

LLMs Meet Multimodal Generation and Editing: A Survey

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

A Survey on Benchmarks of Multimodal Large Language Models

Large Multimodal Agents: A Survey

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

A Survey of Large Language Models

From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models