Evaluating and Advancing Multimodal Large Language Models in Ability Lens

Feng Chen,Chenhui Gou,Jing Liu,Yang Yang,Zhaoyang Li,Jiyuan Zhang,Zhenbang Sun,Bohan Zhuang,Qi Wu
2024-11-22
Abstract:As multimodal large language models (MLLMs) advance rapidly, rigorous evaluation has become essential, providing further guidance for their development. In this work, we focus on a unified and robust evaluation of \textbf{vision perception} abilities, the foundational skill of MLLMs. We find that existing perception benchmarks, each focusing on different question types, domains, and evaluation metrics, introduce significant evaluation variance, complicating comprehensive assessments of perception abilities when relying on any single benchmark. To address this, we introduce \textbf{AbilityLens}, a unified benchmark designed to evaluate MLLMs across six key perception abilities, focusing on both accuracy and stability, with each ability encompassing diverse question types, domains, and metrics. With the assistance of AbilityLens, we: (1) identify the strengths and weaknesses of current models, highlighting stability patterns and revealing a notable performance gap between open-source and closed-source models; (2) introduce an online evaluation mode, which uncovers interesting ability conflict and early convergence phenomena during MLLM training; and (3) design a simple ability-specific model merging method that combines the best ability checkpoint from early training stages, effectively mitigating performance decline due to ability conflict. The benchmark and online leaderboard will be released soon.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in the visual perception ability evaluation of Multimodal Large Language Models (MLLMs): 1. **Inconsistency of evaluation benchmarks**: - Existing perception benchmarks (such as MME, MMBench, MuirBench, etc.) each focus on different problem types, domains, and evaluation metrics, resulting in significant differences in evaluation results. This makes it complicated to rely on a single benchmark for comprehensive evaluation. - The best and worst models vary among different benchmarks, and a unified and comprehensive evaluation perspective cannot be provided. 2. **Lack of emphasis on stability**: - Current evaluation methods often over - emphasize the accuracy of the model while ignoring the stability of the model under different factors (such as domain, problem type, and evaluation metrics). For example, a model that performs excellently on some tasks may be unstable on other tasks. - The paper points out that stability is an important dimension for measuring model performance, especially in multi - modal tasks, and it is crucial to ensure that the model can maintain consistent performance under various conditions. 3. **Capability conflicts during the training process**: - In the training process of MLLMs, the development curves of different perception capabilities are different, and the performance of some capabilities may even decline after further training, which is called "capability conflict". - This phenomenon reveals the limitations of existing training methods and requires more refined evaluation tools to monitor and optimize these capabilities. To solve the above problems, the author introduces a new comprehensive evaluation benchmark - **AbilityLens**. AbilityLens improves the existing evaluation methods in the following ways: - **Unified evaluation framework**: It covers six core perception capabilities (counting, OCR, attribute recognition, entity extraction, localization, and structured data understanding), ensuring comprehensiveness and consistency of evaluation. - **Online and offline evaluation modes**: It not only supports offline evaluation to compare the overall performance of the model, but also provides an online evaluation mode for real - time monitoring of dynamic changes during the training process. - **Capability - specific model fusion method**: It proposes a simple and effective strategy to enhance specific capabilities by merging the best checkpoints in the early training stage, thereby alleviating capability conflicts and improving overall performance. Through these improvements, AbilityLens provides a more systematic and comprehensive evaluation tool for the development and optimization of MLLMs.