VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Haodong Duan,Junming Yang,Yuxuan Qiao,Xinyu Fang,Lin Chen,Yuan Liu,Amit Agarwal,Zhe Chen,Mo Li,Yubo Ma,Hailong Sun,Xiangyu Zhao,Junbo Cui,Xiaoyi Dong,Yuhang Zang,Pan Zhang,Jiaqi Wang,Dahua Lin,Kai Chen

2024-09-12

Abstract:We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at <a class="link-external link-https" href="https://github.com/open-compass/VLMEvalKit" rel="external noopener nofollow">this https URL</a> and is actively maintained.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the challenges in the evaluation process of Large Multi-Modality Models (LMMs). Specifically: 1. **Evaluation Difficulty**: As large multi-modality models develop, it becomes increasingly important to conduct comprehensive and detailed evaluations. However, for small research teams, evaluating these models across multiple benchmarks is a daunting task, requiring data preparation from different repositories and managing potential environment conflicts. 2. **Incomplete Results**: The authors of benchmark tests may not provide evaluation results for all LMMs that users are interested in, leading to significant effort needed to compile incomplete evaluation results. To address the above issues, the paper introduces VLMEvalKit—an open-source toolkit designed to simplify the evaluation process of LMMs. This toolkit supports over 70 different large multi-modality models and more than 20 multi-modality benchmarks. Through a unified interface design, new models can be easily integrated into the toolkit, automatically handling tasks such as data preparation, distributed inference, post-prediction processing, and metric calculation. Additionally, VLMEvalKit employs a generative evaluation method, using large language models as choice extractors to mitigate the impact of response styles on evaluation results, thereby improving the reliability and reproducibility of evaluations. Based on these evaluation results, the authors maintain a comprehensive leaderboard, the OpenVLM Leaderboard, to track the progress of multi-modality learning research.

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

A Survey on Benchmarks of Multimodal Large Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models

VHELM: A Holistic Evaluation of Vision Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Are We on the Right Way for Evaluating Large Vision-Language Models?

EVLM: An Efficient Vision-Language Model for Visual Understanding

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites