Abstract:Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inadequacy of current evaluation methods for visual - language generation reward models (VL - GenRMs). Specifically, existing evaluation methods mainly rely on artificial intelligence - annotated preference labels in traditional visual - language tasks, which may lead to bias and are often unable to effectively challenge the state - of - the - art models. To overcome these limitations, the authors introduce VL - RewardBench, a comprehensive benchmark set that covers general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through an AI - assisted annotation pipeline that combines sample selection and human verification, the authors have carefully curated 1,250 high - quality examples specifically for probing the limitations of models. ### Main Problems and Solutions 1. **Limitations of Existing Evaluation Methods**: - **Bias Problem**: Current evaluation methods mainly rely on AI - annotated preference labels, which may introduce systematic bias. - **Lack of Challenge**: Existing methods usually use simple queries and cannot capture the complex requirements in real - world applications, so it is difficult to distinguish between rapidly evolving LVLMs. 2. **Design Goals of VL - RewardBench**: - **Diversity**: Cover multiple real - world application scenarios. - **Difficulty**: Be difficult enough to expose the limitations of current models. - **Objectivity**: Provide objective ground - truth labels. ### Dataset Construction The dataset construction process of VL - RewardBench is divided into two main parts: 1. **Integrated Filtering Strategy**: For general multimodal instructions and visual hallucination queries, use small models to collaboratively screen out challenging samples. 2. **AI - Assisted Preference Labeling**: For multimodal reasoning tasks without preference labels, design an AI - assisted preference - labeling framework to generate high - quality preference pairs. ### Experimental Results By evaluating 16 leading VL - GenRMs, VL - RewardBench reveals significant performance gaps among current models. Even for leading commercial models (such as GPT - 4o and Gemini - 1.5 - Pro), their accuracies are only 62.4% and 62.5%, while the performance of the state - of - the - art open - source models (such as Qwen2 - VL - 72B and Llama - 3.2 - 90B) is even close to the random - guessing level (43.0% and 53.9%). In addition, the study also found some key insights, including: - **Visual Perception is the Main Bottleneck**: The error rate of models on presence/recognition tasks is significantly higher than that on reasoning tasks. - **The Effect of Expansion at Test Time Varies with Model Capacity**: It is beneficial for large models but may reduce the performance of small models. - **Training VL - GenRMs for Judgment**: By training VL - GenRMs to learn judgment, the judgment ability can be significantly improved (for example, the accuracy of 7B LLaVA - OneVision - 7B - ov is increased by 14.7%). ### Conclusion As a valuable benchmark set, VL - RewardBench can not only effectively evaluate the reliability and effectiveness of current VL - GenRMs but also provide clear directions for future research.

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

MMBench: Is Your Multi-modal Model an All-around Player?

Calibrated Self-Rewarding Vision Language Models

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Vision-Language Models as a Source of Rewards

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models