Abstract:In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data types-including text, images, videos, audio, and physiological sequences-MLLMs address the complexities of real-world applications far beyond the capabilities of single-modality systems. In this paper, we systematically sort out the applications of MLLM in multimodal tasks such as natural language, vision, and audio. We also provide a comparative analysis of the focus of different MLLMs in the tasks, and provide insights into the shortcomings of current MLLMs, and suggest potential directions for future research. Through these discussions, this paper hopes to provide valuable insights for the further development and application of MLLM.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to comprehensively review the applications and challenges of Multimodal Large Language Models (MLLMs) in various tasks. Specifically: 1. **Multimodal Data Integration**: - With the rapid development of information technology and the surge in data volume, unimodal systems can no longer meet the demands of complex real-world tasks. MLLMs provide a more comprehensive and accurate representation of information by integrating various types of data (such as text, images, videos, audio, and physiological sequences). 2. **Applications in Different Fields**: - The paper discusses in detail the applications of MLLMs in natural language processing (NLP), visual tasks, and audio tasks. For example, in NLP, MLLMs use images, videos, and audio to enhance text generation and machine translation; in visual tasks, MLLMs improve the performance of image classification, object detection, and other tasks; in audio tasks, MLLMs enhance speech recognition and emotion analysis capabilities. 3. **Technical Architecture and Components**: - The paper introduces the basic concepts and main architectures of MLLMs, including multimodal input encoders, feature fusion mechanisms, and multimodal output decoders. These components work together to enable the model to efficiently process and integrate data from different modalities. 4. **Future Research Directions**: - The paper also points out the current shortcomings of MLLMs and proposes future research directions. By comprehensively evaluating existing implementations and technological advancements, the paper aims to provide valuable references and guidance for the development of MLLMs. Overall, this paper aims to provide valuable insights for the further development and application of MLLMs and to promote their leading position in artificial intelligence technology.

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

A Survey of Multimodal Large Language Model from A Data-centric Perspective

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Survey on Evaluation of Multimodal Large Language Models

A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

Multimodal Large Language Models: A Survey

Efficient Multimodal Large Language Models: A Survey

A Survey on Multimodal Large Language Models

A Review of Multi-Modal Large Language and Vision Models

Large Multimodal Agents: A Survey

A Survey on Benchmarks of Multimodal Large Language Models

How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model

Personalized Multimodal Large Language Models: A Survey

Multi-modal large language models in radiology: principles, applications, and potential

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

A Survey on Multimodal Large Language Models for Autonomous Driving

The Revolution of Multimodal Large Language Models: A Survey