InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Kirolos Ataallah,Chenhui Gou,Eslam Abdelrahman,Khushbu Pahwa,Jian Ding,Mohamed Elhoseiny
2024-08-31
Abstract:Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 49.16\% and 42.72\%, and average scores of 3.22 and 2.71 out of 5, respectively. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at <a class="link-external link-https" href="https://vision-cair.github.io/InfiniBench/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is that existing video understanding benchmarks mainly focus on shorter video clips and lack evaluation for understanding long videos (ranging from tens of minutes to several hours). Long video understanding not only increases the number of images but also contains more comprehensive information, making it a key task for advancing artificial intelligence to human-level capabilities. Specifically, the paper presents the following challenges: 1. **Video Length**: Existing benchmarks primarily focus on shorter video clips, while long video understanding requires handling content over a longer duration. 2. **Diversity of Questions**: The types of questions in existing benchmarks are relatively homogeneous, lacking comprehensive evaluation of various skills. 3. **Human-Level Understanding**: Existing models perform poorly when processing long videos, especially when deep understanding of events or characters is required. To address these challenges, the authors propose **InfiniBench**, a comprehensive long video understanding benchmark with the following features: - **Longest Video Duration**: The average duration of each video is 52.59 minutes. - **Most Question-Answer Pairs**: It includes 1,082,000 question-answer pairs. - **Diverse Skill Evaluation**: It covers nine different skills, including global appearance, scene transitions, summarization, action sequences of each character, temporal order, event linking, deep contextual understanding, movie spoiler questions, and local visual and contextual questions. - **Human-Centric Design**: The video sources are from movies and daily TV series, with specifically designed human-level questions such as movie spoiler questions that require critical thinking and comprehensive understanding. By introducing **InfiniBench**, the authors aim to fill the gap in large-scale long video understanding benchmarks, promote the development of current open-source multimodal models, and push multimodal models towards human-level long video understanding.