Abstract:Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.

What problem does this paper attempt to address?

The paper aims to address a series of issues in the evaluation of Vision-Language Models (VLM), specifically including: 1. **Heavy Evaluation Burden**: With the rapid development of the vision-language model field, a large number of evaluation benchmarks have emerged, bringing a heavy implementation burden to researchers and requiring a significant amount of computational resources to run these benchmarks. 2. **Fragmented Evaluation Results**: Since different models are only evaluated on some benchmarks, the evaluation results are fragmented, making it difficult to comprehensively understand the strengths and weaknesses of the models. 3. **Lack of a Systematic Evaluation Framework**: Currently, there is a lack of a unified, systematic framework to evaluate the capabilities of VLMs, which hinders researchers' understanding of model performance improvements. To address these issues, the paper proposes **UniBench**—a unified evaluation framework that includes more than 50 vision-language model benchmarks. This framework covers a wide range of capabilities from object recognition to spatial awareness, counting, and more. By categorizing these benchmarks into 7 types and 17 more fine-grained capability categories, it facilitates researchers in quickly identifying the strengths and weaknesses of models. The paper also conducts a large-scale evaluation of nearly 60 publicly available vision-language models, exploring the impact of increasing model size and training data volume on the performance of different tasks. The study finds that although the growth in model size and data volume helps improve the performance of many tasks, it offers less benefit for reasoning and relationship understanding tasks. Additionally, surprisingly, even the best VLMs perform poorly on simple digit recognition and counting tasks, such as performing much worse on the MNIST dataset compared to simple neural networks. Finally, the paper provides recommendations on how to choose the appropriate model and releases an easy-to-use UniBench codebase, which includes all 50+ benchmarks and comparison results of 59 models, as well as a representative benchmark set that can be evaluated within 5 minutes on a single GPU. This contribution is expected to promote thorough and practical evaluation of vision-language model capabilities, thereby better measuring research progress and proposing promising strategies to advance VLM research.

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

MMBench: Is Your Multi-modal Model an All-around Player?

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

TVBench: Redesigning Video-Language Evaluation

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

DevBench: A multimodal developmental benchmark for language learning

Quantifying Variance in Evaluation Benchmarks