Abstract:Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when evaluating the capabilities of large - scale multimodal models (LMMs) in video analysis, existing methods have limitations. Specifically: 1. **Limitations of Existing Benchmarks**: - Most existing video - analysis benchmarks rely on traditional evaluation methods, such as multiple - choice questions, which are insufficient in capturing the complex needs of real - world users. - These benchmarks usually pre - define core video - analysis skills, such as object recognition in a single frame and action reasoning across frames, adopting a capability - centered approach while ignoring the diverse questions that users may actually ask. - The cost of human - annotated video tasks is high and time - consuming, which limits the scalability of generating high - quality questions. 2. **Proposed New Method**: - The paper introduces **VideoAutoArena**, an automated arena - style benchmarking framework that aims to evaluate the video - understanding capabilities of LMMs by automatically generating open - ended, adaptable questions through user simulation. - VideoAutoArena utilizes state - of - the - art LMMs as agents for user simulation and preference selection, eliminating the need for expensive human annotation and achieving efficient and scalable evaluation. - This framework integrates a failure - driven hard - prompt evolution strategy, generating increasingly complex and challenging questions based on model performance to ensure a more rigorous testing environment. 3. **Objectives**: - Bridge the gap between capability - centered evaluation and the requirements of practical applications, providing an evaluation method that is closer to real - user interactions. - Through an automated evaluation and ranking system, provide valuable insights for the development of LMMs, helping to identify the strengths and improvement directions of the models. In summary, this paper aims to address the deficiencies of existing video - analysis benchmarks in terms of real - user needs and scalability by introducing VideoAutoArena, providing a more comprehensive and efficient method for the evaluation of LMMs.

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method

Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

MMBench: Is Your Multi-modal Model an All-around Player?

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos