Abstract:Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 49.16\% and 42.72\%, and average scores of 3.22 and 2.71 out of 5, respectively. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at <a class="link-external link-https" href="https://vision-cair.github.io/InfiniBench/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem this paper attempts to address is that existing video understanding benchmarks mainly focus on shorter video clips and lack evaluation for understanding long videos (ranging from tens of minutes to several hours). Long video understanding not only increases the number of images but also contains more comprehensive information, making it a key task for advancing artificial intelligence to human-level capabilities. Specifically, the paper presents the following challenges: 1. **Video Length**: Existing benchmarks primarily focus on shorter video clips, while long video understanding requires handling content over a longer duration. 2. **Diversity of Questions**: The types of questions in existing benchmarks are relatively homogeneous, lacking comprehensive evaluation of various skills. 3. **Human-Level Understanding**: Existing models perform poorly when processing long videos, especially when deep understanding of events or characters is required. To address these challenges, the authors propose **InfiniBench**, a comprehensive long video understanding benchmark with the following features: - **Longest Video Duration**: The average duration of each video is 52.59 minutes. - **Most Question-Answer Pairs**: It includes 1,082,000 question-answer pairs. - **Diverse Skill Evaluation**: It covers nine different skills, including global appearance, scene transitions, summarization, action sequences of each character, temporal order, event linking, deep contextual understanding, movie spoiler questions, and local visual and contextual questions. - **Human-Centric Design**: The video sources are from movies and daily TV series, with specifically designed human-level questions such as movie spoiler questions that require critical thinking and comprehensive understanding. By introducing **InfiniBench**, the authors aim to fill the gap in large-scale long video understanding benchmarks, promote the development of current open-source multimodal models, and push multimodal models towards human-level long video understanding.

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

LVBench: An Extreme Long Video Understanding Benchmark

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Towards Event-oriented Long Video Understanding

CinePile: A Long Video Question Answering Dataset and Benchmark

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

HourVideo: 1-Hour Video-Language Understanding

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering