VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Ziyang Luo,Haoning Wu,Dongxu Li,Jing Ma,Mohan Kankanhalli,Junnan Li
2024-11-20
Abstract:Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when evaluating the capabilities of large - scale multimodal models (LMMs) in video analysis, existing methods have limitations. Specifically: 1. **Limitations of Existing Benchmarks**: - Most existing video - analysis benchmarks rely on traditional evaluation methods, such as multiple - choice questions, which are insufficient in capturing the complex needs of real - world users. - These benchmarks usually pre - define core video - analysis skills, such as object recognition in a single frame and action reasoning across frames, adopting a capability - centered approach while ignoring the diverse questions that users may actually ask. - The cost of human - annotated video tasks is high and time - consuming, which limits the scalability of generating high - quality questions. 2. **Proposed New Method**: - The paper introduces **VideoAutoArena**, an automated arena - style benchmarking framework that aims to evaluate the video - understanding capabilities of LMMs by automatically generating open - ended, adaptable questions through user simulation. - VideoAutoArena utilizes state - of - the - art LMMs as agents for user simulation and preference selection, eliminating the need for expensive human annotation and achieving efficient and scalable evaluation. - This framework integrates a failure - driven hard - prompt evolution strategy, generating increasingly complex and challenging questions based on model performance to ensure a more rigorous testing environment. 3. **Objectives**: - Bridge the gap between capability - centered evaluation and the requirements of practical applications, providing an evaluation method that is closer to real - user interactions. - Through an automated evaluation and ranking system, provide valuable insights for the development of LMMs, helping to identify the strengths and improvement directions of the models. In summary, this paper aims to address the deficiencies of existing video - analysis benchmarks in terms of real - user needs and scalability by introducing VideoAutoArena, providing a more comprehensive and efficient method for the evaluation of LMMs.