Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has been accompanied by the development of various benchmarks to evaluate their capabilities. However, the true nature of these evaluations and the extent to which they assess multimodal reasoning versus merely leveraging the underlying Large Language Model (LLM) backbone remain unclear. This paper presents a comprehensive investigation into the role of LLM backbones in MLLM evaluation, focusing on two critical aspects: the degree to which current benchmarks truly assess multimodal reasoning and the influence of LLM prior knowledge on performance. Specifically, we introduce a modified evaluation protocol to disentangle the contributions of the LLM backbone from multimodal integration, and an automatic knowledge identification technique for diagnosing whether LLMs equip the necessary knowledge for corresponding multimodal questions. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50\% of error rates can be attributed to insufficient world knowledge in the LLM backbone, indicating a heavy reliance on language capabilities. To address knowledge deficiencies, we propose a knowledge augmentation pipeline that achieves significant performance gains, with improvements of up to 60\% on certain datasets, resulting in a approximately 4x increase in performance. Our work provides crucial insights into the role of the LLM backbone in MLLMs, and highlights the need for more nuanced benchmarking approaches.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two key aspects in the evaluation of Multimodal Large Language Models (MLLMs): 1. **Whether current benchmark tests truly evaluate multimodal reasoning abilities**: The paper explores to what extent the existing multimodal evaluation benchmarks can truly reflect the multimodal reasoning abilities of models, rather than merely relying on the capabilities of the underlying Large Language Model (LLM). The research finds that for some benchmark tests, models can achieve high scores even without visual input, indicating that these benchmark tests may rely too much on the language model component and fail to effectively evaluate multimodal integration capabilities. 2. **The impact of LLM prior knowledge on performance**: The paper analyzes how the prior knowledge of LLM affects the performance of MLLM. The research finds that up to 50% of the error rate can be attributed to the lack of sufficient world knowledge in LLM. In addition, MLLMs using LLMs with rich knowledge (such as LLaVA - Next - Yi - 34B and InternVL2 - Llama3 - 76B) perform better in the evaluation, highlighting the important impact of LLM prior knowledge on overall performance. To further explore these issues, the author proposes the following methods: - **Modified evaluation protocol**: By removing visual input, randomizing the order of options, and converting multiple - choice questions into open - ended generation tasks, to more comprehensively understand the role of language ability and multimodal reasoning in these benchmark tests. - **Automatic knowledge recognition technology**: Utilize external knowledge bases (such as Wikipedia or powerful LLMs) to obtain the necessary knowledge behind each question and check whether the underlying LLM has the background knowledge required to handle the corresponding multimodal problems. Through these methods, the paper reveals some limitations of current multimodal evaluation benchmarks and proposes a knowledge - enhanced pipeline, which significantly improves performance on certain datasets, with an absolute accuracy improvement of up to 60%. These findings emphasize the need for more detailed benchmarking methods in future MLLM development and evaluation to distinguish between language model capabilities and true multimodal reasoning capabilities.

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

A Survey on Benchmarks of Multimodal Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

A Survey on Evaluation of Multimodal Large Language Models

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

LIME: Less Is More for MLLM Evaluation

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

CMMLU: Measuring massive multitask language understanding in Chinese

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Revisiting Multi-Modal LLM Evaluation

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

MM-LLMs: Recent Advances in MultiModal Large Language Models