A Survey on Evaluation of Multimodal Large Language Models

Jiaxing Huang,Jingyi Zhang

2024-08-28

Abstract:Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the "brain" and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) "what to evaluate" that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including general multimodal recognition, perception, reasoning and trustworthiness, and domain-specific applications such as socioeconomic, natural sciences and engineering, medical usage, AI agent, remote sensing, video and audio processing, 3D point cloud analysis, and others; (3) "where to evaluate" that summarizes MLLM evaluation benchmarks into general and specific benchmarks; (4) "how to evaluate" that reviews and illustrates MLLM evaluation steps and metrics; Our overarching goal is to provide valuable insights for researchers in the field of MLLM evaluation, thereby facilitating the development of more capable and reliable MLLMs. We emphasize that evaluation should be regarded as a critical discipline, essential for advancing the field of MLLMs.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of a systematic review of evaluation methods for Multimodal Large Language Models (MLLMs). Specifically, the goals of the paper include the following aspects: 1. **Evaluation Framework**: Provide a comprehensive and systematic review of evaluation methods for multimodal large language models, covering general multimodal recognition, perception, reasoning abilities, and application-specific evaluations. 2. **Evaluation Task Classification**: Classify and summarize existing evaluation tasks for multimodal large language models, including but not limited to general multimodal recognition, perception, reasoning, and credibility, with detailed discussions on fields such as socio-economics, natural sciences and engineering, and medical applications. 3. **Evaluation Benchmarks**: Summarize existing evaluation benchmarks for multimodal large language models, dividing them into general benchmarks and specific benchmarks. 4. **Evaluation Steps and Metrics**: Introduce and explain the evaluation steps and metrics for multimodal large language models, providing valuable insights for researchers to promote the development of more robust and reliable multimodal large language models. Through these efforts, the paper hopes to fill the current gap in systematic reviews of evaluation methods for multimodal large language models, providing support for further research and development in this field.

A Survey on Evaluation of Multimodal Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Survey on Benchmarks of Multimodal Large Language Models

A Survey on Multimodal Large Language Models

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

A Survey on Evaluation of Large Language Models

A Survey of Multimodal Large Language Model from A Data-centric Perspective

A Review of Multi-Modal Large Language and Vision Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Efficient Multimodal Large Language Models: A Survey

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Large Multimodal Agents: A Survey

How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model

Evaluating Large Language Models: A Comprehensive Survey

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks