Abstract:Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

LegalAgentBench: Evaluating LLM Agents in Legal Domain

ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents