MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Xiaocui Yang,Wenfang Wu,Shi Feng,Ming Wang,Daling Wang,Yang Li,Qi Sun,Yifei Zhang,Xiaoming Fu,Soujanya Poria
2023-10-13
Abstract:The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at <a class="link-external link-https" href="https://github.com/declare-lab/MM-BigBench" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Multimedia
What problem does this paper attempt to address?
The paper aims to address the evaluation issue of multimodal large language models (MLLMs) in multimodal content understanding tasks. Currently, most research focuses on evaluating these models' understanding and reasoning abilities for single-modal (visual) content, while neglecting the performance evaluation in multimodal (visual-text) content understanding. The authors propose a comprehensive evaluation framework, MM-BigBench, to assess the performance of various models in multimodal content understanding tasks. This framework covers multiple metrics, such as best performance, average relative gain, stability, and adaptability, to provide a more comprehensive and holistic evaluation. Specifically, the main contributions of the paper include: 1. Evaluation of 20 models, including 14 multimodal large language models, covering 6 different multimodal content understanding tasks across 14 datasets. 2. Introduction of the MM-BigBench evaluation framework, which includes a variety of diversified metrics to comprehensively assess the performance of different models and instructions. 3. Extensive experiments and the establishment of benchmarks for LLMs and MLLMs in multimodal content understanding tasks, leading to innovative conclusions.