VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Haodong Duan,Junming Yang,Yuxuan Qiao,Xinyu Fang,Lin Chen,Yuan Liu,Amit Agarwal,Zhe Chen,Mo Li,Yubo Ma,Hailong Sun,Xiangyu Zhao,Junbo Cui,Xiaoyi Dong,Yuhang Zang,Pan Zhang,Jiaqi Wang,Dahua Lin,Kai Chen
2024-09-12
Abstract:We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at <a class="link-external link-https" href="https://github.com/open-compass/VLMEvalKit" rel="external noopener nofollow">this https URL</a> and is actively maintained.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenges in the evaluation process of Large Multi-Modality Models (LMMs). Specifically: 1. **Evaluation Difficulty**: As large multi-modality models develop, it becomes increasingly important to conduct comprehensive and detailed evaluations. However, for small research teams, evaluating these models across multiple benchmarks is a daunting task, requiring data preparation from different repositories and managing potential environment conflicts. 2. **Incomplete Results**: The authors of benchmark tests may not provide evaluation results for all LMMs that users are interested in, leading to significant effort needed to compile incomplete evaluation results. To address the above issues, the paper introduces VLMEvalKit—an open-source toolkit designed to simplify the evaluation process of LMMs. This toolkit supports over 70 different large multi-modality models and more than 20 multi-modality benchmarks. Through a unified interface design, new models can be easily integrated into the toolkit, automatically handling tasks such as data preparation, distributed inference, post-prediction processing, and metric calculation. Additionally, VLMEvalKit employs a generative evaluation method, using large language models as choice extractors to mitigate the impact of response styles on evaluation results, thereby improving the reliability and reproducibility of evaluations. Based on these evaluation results, the authors maintain a comprehensive leaderboard, the OpenVLM Leaderboard, to track the progress of multi-modality learning research.