Abstract:As multimodal large language models (MLLMs) advance rapidly, rigorous evaluation has become essential, providing further guidance for their development. In this work, we focus on a unified and robust evaluation of \textbf{vision perception} abilities, the foundational skill of MLLMs. We find that existing perception benchmarks, each focusing on different question types, domains, and evaluation metrics, introduce significant evaluation variance, complicating comprehensive assessments of perception abilities when relying on any single benchmark. To address this, we introduce \textbf{AbilityLens}, a unified benchmark designed to evaluate MLLMs across six key perception abilities, focusing on both accuracy and stability, with each ability encompassing diverse question types, domains, and metrics. With the assistance of AbilityLens, we: (1) identify the strengths and weaknesses of current models, highlighting stability patterns and revealing a notable performance gap between open-source and closed-source models; (2) introduce an online evaluation mode, which uncovers interesting ability conflict and early convergence phenomena during MLLM training; and (3) design a simple ability-specific model merging method that combines the best ability checkpoint from early training stages, effectively mitigating performance decline due to ability conflict. The benchmark and online leaderboard will be released soon.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in the visual perception ability evaluation of Multimodal Large Language Models (MLLMs): 1. **Inconsistency of evaluation benchmarks**: - Existing perception benchmarks (such as MME, MMBench, MuirBench, etc.) each focus on different problem types, domains, and evaluation metrics, resulting in significant differences in evaluation results. This makes it complicated to rely on a single benchmark for comprehensive evaluation. - The best and worst models vary among different benchmarks, and a unified and comprehensive evaluation perspective cannot be provided. 2. **Lack of emphasis on stability**: - Current evaluation methods often over - emphasize the accuracy of the model while ignoring the stability of the model under different factors (such as domain, problem type, and evaluation metrics). For example, a model that performs excellently on some tasks may be unstable on other tasks. - The paper points out that stability is an important dimension for measuring model performance, especially in multi - modal tasks, and it is crucial to ensure that the model can maintain consistent performance under various conditions. 3. **Capability conflicts during the training process**: - In the training process of MLLMs, the development curves of different perception capabilities are different, and the performance of some capabilities may even decline after further training, which is called "capability conflict". - This phenomenon reveals the limitations of existing training methods and requires more refined evaluation tools to monitor and optimize these capabilities. To solve the above problems, the author introduces a new comprehensive evaluation benchmark - **AbilityLens**. AbilityLens improves the existing evaluation methods in the following ways: - **Unified evaluation framework**: It covers six core perception capabilities (counting, OCR, attribute recognition, entity extraction, localization, and structured data understanding), ensuring comprehensiveness and consistency of evaluation. - **Online and offline evaluation modes**: It not only supports offline evaluation to compare the overall performance of the model, but also provides an online evaluation mode for real - time monitoring of dynamic changes during the training process. - **Capability - specific model fusion method**: It proposes a simple and effective strategy to enhance specific capabilities by merging the best checkpoints in the early training stage, thereby alleviating capability conflicts and improving overall performance. Through these improvements, AbilityLens provides a more systematic and comprehensive evaluation tool for the development and optimization of MLLMs.

Evaluating and Advancing Multimodal Large Language Models in Ability Lens

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

A Survey on Benchmarks of Multimodal Large Language Models

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

A Survey on Evaluation of Multimodal Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Are We on the Right Way for Evaluating Large Vision-Language Models?

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Lens: Rethinking Multilingual Enhancement for Large Language Models

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Face-MLLM: A Large Face Perception Model

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models