MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

Weihao Yu,Zhengyuan Yang,Linfeng Ren,Linjie Li,Jianfeng Wang,Kevin Lin,Chung-Ching Lin,Zicheng Liu,Lijuan Wang,Xinchao Wang
2024-08-02
Abstract:MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the limitations of large multimodal models (LMMs) in evaluation, particularly in assessing the more advanced capabilities exhibited by the latest LMMs. Specifically, the paper proposes the MM-Vet v2 benchmark to overcome two main limitations of the original MM-Vet benchmark: 1. **Enhanced Capability Assessment**: MM-Vet v2 introduces a new core capability—image-text sequence understanding—to evaluate the model's ability to handle interleaved image and text sequence data. This capability is crucial for advanced LMMs but was not covered in the previous benchmark. 2. **Expanded Evaluation Set Size**: The paper improves the quality and scale of the evaluation set by increasing the number of high-quality evaluation samples, ensuring that the evaluation results are more comprehensive and representative. With these improvements, MM-Vet v2 can more effectively evaluate the current state-of-the-art multimodal models and found that Claude 3.5 Sonnet is the best-performing model, scoring 71.8, slightly higher than GPT-4o's 71.0. Among open-source models, InternVL2-Llama3-76B performed well, scoring 68.4.