UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Haider Al-Tahan,Quentin Garrido,Randall Balestriero,Diane Bouchacourt,Caner Hazirbas,Mark Ibrahim
2024-08-09
Abstract:Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address a series of issues in the evaluation of Vision-Language Models (VLM), specifically including: 1. **Heavy Evaluation Burden**: With the rapid development of the vision-language model field, a large number of evaluation benchmarks have emerged, bringing a heavy implementation burden to researchers and requiring a significant amount of computational resources to run these benchmarks. 2. **Fragmented Evaluation Results**: Since different models are only evaluated on some benchmarks, the evaluation results are fragmented, making it difficult to comprehensively understand the strengths and weaknesses of the models. 3. **Lack of a Systematic Evaluation Framework**: Currently, there is a lack of a unified, systematic framework to evaluate the capabilities of VLMs, which hinders researchers' understanding of model performance improvements. To address these issues, the paper proposes **UniBench**—a unified evaluation framework that includes more than 50 vision-language model benchmarks. This framework covers a wide range of capabilities from object recognition to spatial awareness, counting, and more. By categorizing these benchmarks into 7 types and 17 more fine-grained capability categories, it facilitates researchers in quickly identifying the strengths and weaknesses of models. The paper also conducts a large-scale evaluation of nearly 60 publicly available vision-language models, exploring the impact of increasing model size and training data volume on the performance of different tasks. The study finds that although the growth in model size and data volume helps improve the performance of many tasks, it offers less benefit for reasoning and relationship understanding tasks. Additionally, surprisingly, even the best VLMs perform poorly on simple digit recognition and counting tasks, such as performing much worse on the MNIST dataset compared to simple neural networks. Finally, the paper provides recommendations on how to choose the appropriate model and releases an easy-to-use UniBench codebase, which includes all 50+ benchmarks and comparison results of 59 models, as well as a representative benchmark set that can be evaluated within 5 minutes on a single GPU. This contribution is expected to promote thorough and practical evaluation of vision-language model capabilities, thereby better measuring research progress and proposing promising strategies to advance VLM research.