VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Lei Li,Yuancheng Wei,Zhihui Xie,Xuqing Yang,Yifan Song,Peiyi Wang,Chenxin An,Tianyu Liu,Sujian Li,Bill Yuchen Lin,Lingpeng Kong,Qi Liu
2024-11-26
Abstract:Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inadequacy of current evaluation methods for visual - language generation reward models (VL - GenRMs). Specifically, existing evaluation methods mainly rely on artificial intelligence - annotated preference labels in traditional visual - language tasks, which may lead to bias and are often unable to effectively challenge the state - of - the - art models. To overcome these limitations, the authors introduce VL - RewardBench, a comprehensive benchmark set that covers general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through an AI - assisted annotation pipeline that combines sample selection and human verification, the authors have carefully curated 1,250 high - quality examples specifically for probing the limitations of models. ### Main Problems and Solutions 1. **Limitations of Existing Evaluation Methods**: - **Bias Problem**: Current evaluation methods mainly rely on AI - annotated preference labels, which may introduce systematic bias. - **Lack of Challenge**: Existing methods usually use simple queries and cannot capture the complex requirements in real - world applications, so it is difficult to distinguish between rapidly evolving LVLMs. 2. **Design Goals of VL - RewardBench**: - **Diversity**: Cover multiple real - world application scenarios. - **Difficulty**: Be difficult enough to expose the limitations of current models. - **Objectivity**: Provide objective ground - truth labels. ### Dataset Construction The dataset construction process of VL - RewardBench is divided into two main parts: 1. **Integrated Filtering Strategy**: For general multimodal instructions and visual hallucination queries, use small models to collaboratively screen out challenging samples. 2. **AI - Assisted Preference Labeling**: For multimodal reasoning tasks without preference labels, design an AI - assisted preference - labeling framework to generate high - quality preference pairs. ### Experimental Results By evaluating 16 leading VL - GenRMs, VL - RewardBench reveals significant performance gaps among current models. Even for leading commercial models (such as GPT - 4o and Gemini - 1.5 - Pro), their accuracies are only 62.4% and 62.5%, while the performance of the state - of - the - art open - source models (such as Qwen2 - VL - 72B and Llama - 3.2 - 90B) is even close to the random - guessing level (43.0% and 53.9%). In addition, the study also found some key insights, including: - **Visual Perception is the Main Bottleneck**: The error rate of models on presence/recognition tasks is significantly higher than that on reasoning tasks. - **The Effect of Expansion at Test Time Varies with Model Capacity**: It is beneficial for large models but may reduce the performance of small models. - **Training VL - GenRMs for Judgment**: By training VL - GenRMs to learn judgment, the judgment ability can be significantly improved (for example, the accuracy of 7B LLaVA - OneVision - 7B - ov is increased by 14.7%). ### Conclusion As a valuable benchmark set, VL - RewardBench can not only effectively evaluate the reliability and effectiveness of current VL - GenRMs but also provide clear directions for future research.