MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Xiaocui Yang,Wenfang Wu,Shi Feng,Ming Wang,Daling Wang,Yang Li,Qi Sun,Yifei Zhang,Xiaoming Fu,Soujanya Poria

2023-10-13

Abstract:The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at <a class="link-external link-https" href="https://github.com/declare-lab/MM-BigBench" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Multimedia

What problem does this paper attempt to address?

The paper aims to address the evaluation issue of multimodal large language models (MLLMs) in multimodal content understanding tasks. Currently, most research focuses on evaluating these models' understanding and reasoning abilities for single-modal (visual) content, while neglecting the performance evaluation in multimodal (visual-text) content understanding. The authors propose a comprehensive evaluation framework, MM-BigBench, to assess the performance of various models in multimodal content understanding tasks. This framework covers multiple metrics, such as best performance, average relative gain, stability, and adaptability, to provide a more comprehensive and holistic evaluation. Specifically, the main contributions of the paper include: 1. Evaluation of 20 models, including 14 multimodal large language models, covering 6 different multimodal content understanding tasks across 14 datasets. 2. Introduction of the MM-BigBench evaluation framework, which includes a variety of diversified metrics to comprehensively assess the performance of different models and instructions. 3. Extensive experiments and the establishment of benchmarks for LLMs and MLLMs in multimodal content understanding tasks, leading to innovative conclusions.

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

A Survey on Benchmarks of Multimodal Large Language Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MMBench: Is Your Multi-modal Model an All-around Player?

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

MMR: Evaluating Reading Ability of Large Multimodal Models

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Needle In A Multimodal Haystack

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MileBench: Benchmarking MLLMs in Long Context

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark