How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model

Shezheng Song,Xiaopeng Li,Shasha Li,Shan Zhao,Jie Yu,Jun Ma,Xiaoguang Mao,Weimin Zhang

DOI: https://doi.org/10.48550/arXiv.2311.07594

2023-12-19

Abstract:This review paper explores Multimodal Large Language Models (MLLMs), which integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data such as text and vision. MLLMs demonstrate capabilities like generating image narratives and answering image-based questions, bridging the gap towards real-world human-computer interactions and hinting at a potential pathway to artificial general intelligence. However, MLLMs still face challenges in processing the semantic gap in multimodality, which may lead to erroneous generation, posing potential risks to society. Choosing the appropriate modality alignment method is crucial, as improper methods might require more parameters with limited performance improvement. This paper aims to explore modality alignment methods for LLMs and their existing capabilities. Implementing modality alignment allows LLMs to address environmental issues and enhance accessibility. The study surveys existing modal alignment methods in MLLMs into four groups: (1) Multimodal Converters that change data into something LLMs can understand; (2) Multimodal Perceivers to improve how LLMs perceive different types of data; (3) Tools Assistance for changing data into one common format, usually text; and (4) Data-Driven methods that teach LLMs to understand specific types of data in a dataset. This field is still in a phase of exploration and experimentation, and we will organize and update various existing research methods for multimodal information alignment.

Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Multimedia

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to bridge the gap between different modalities in multimodal large - language models (MLLMs). Specifically, the paper explores how to extend traditional pure - text large - language models (LLMs) to be able to process multimodal data, such as text and images, thereby enhancing the model's application ability in the real world. For example, it can enhance human - computer interaction by generating image narratives and answering image - based questions, and explore the potential path towards artificial general intelligence. However, MLLMs still face challenges when dealing with multimodal semantic gaps, which may lead to incorrect generation and pose potential risks to society. Therefore, choosing appropriate modality alignment methods is crucial. Inappropriate methods may increase the number of parameters while having limited performance improvement, resulting in high computational and usage costs. The paper aims to explore modality alignment methods applicable to LLMs and their existing capabilities, and proposes four main modality alignment methods: multimodal transformers, multimodal perceptrons, tool - assisted and data - driven methods.

How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model

A Survey on Multimodal Large Language Models

A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

A Survey of Multimodal Large Language Model from A Data-centric Perspective

A Survey on Evaluation of Multimodal Large Language Models

Multimodal Large Language Models: A Survey

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

Large Multimodal Agents: A Survey

LLMs Meet Multimodal Generation and Editing: A Survey

Cross-Modal Consistency in Multimodal Large Language Models

A Review of Multi-Modal Large Language and Vision Models

From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality

OneLLM: One Framework to Align All Modalities with Language

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

ModaVerse: Efficiently Transforming Modalities with LLMs

Efficient Multimodal Large Language Models: A Survey

A Survey on Benchmarks of Multimodal Large Language Models

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey