Abstract:Large vision-language models (LVLMs) have made significant strides in addressing complex video tasks, sparking researchers' interest in their human-like multimodal understanding capabilities. Video description serves as a fundamental task for evaluating video comprehension, necessitating a deep understanding of spatial and temporal dynamics, which presents challenges for both humans and machines. Thus, investigating whether LVLMs can describe videos as comprehensively as humans (through reasonable human-machine comparisons using video captioning as a proxy task) will enhance our understanding and application of these models. However, current benchmarks for video comprehension have notable limitations, including short video durations, brief annotations, and reliance on a single annotator's perspective. These factors hinder a comprehensive assessment of LVLMs' ability to understand complex, lengthy videos and prevent the establishment of a robust human baseline that accurately reflects human video comprehension capabilities. To address these issues, we propose a novel benchmark, FIOVA (Five In One Video Annotations), designed to evaluate the differences between LVLMs and human understanding more comprehensively. FIOVA includes 3,002 long video sequences (averaging 33.6 seconds) that cover diverse scenarios with complex spatiotemporal relationships. Each video is annotated by five distinct annotators, capturing a wide range of perspectives and resulting in captions that are 4-15 times longer than existing benchmarks, thereby establishing a robust baseline that represents human understanding comprehensively for the first time in video description tasks. Using the FIOVA benchmark, we conducted an in-depth evaluation of six state-of-the-art LVLMs, comparing their performance with humans. More detailed information can be found at <a class="link-external link-https" href="https://huuuuusy.github.io/fiova/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is whether large-scale vision-language models (LVLMs) can comprehensively describe videos like humans. Specifically, the authors point out that current benchmarks for evaluating video understanding capabilities have some significant limitations, including short video durations, brief annotations, and reliance on a single annotator's perspective. These factors limit the assessment of understanding complex, long-duration videos and fail to establish a reliable baseline that accurately reflects human video understanding capabilities. To address these issues, the authors propose a new benchmark—FIOVA (Five-In-One Video Annotations), aimed at more comprehensively evaluating the differences between LVLMs and human understanding. FIOVA includes 3,002 long video sequences (with an average duration of 33.6 seconds), covering diverse scenes and complex spatiotemporal relationships. Each video is annotated by five different annotators, generating descriptions that are 4 to 15 times longer than those in existing benchmarks, thereby establishing a robust baseline that comprehensively represents human understanding. Through this benchmark, the authors conducted an in-depth evaluation of six state-of-the-art LVLMs, comparing their performance with human annotations in various aspects of video understanding. The study results indicate that although current LVLMs have made some progress in certain perceptual and reasoning abilities, they still face difficulties in information omission and description depth. Additionally, the authors found significant differences between LVLMs and human annotators when dealing with complex videos, especially in cases where human annotators had inconsistent opinions, with LVLMs often relying on a unified strategy to handle challenging content. In summary, this paper aims to bridge the gap between LVLM and human video understanding by proposing the FIOVA benchmark, providing valuable insights and guidance for future research and development.

Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison

A Human-Machine Collaborative Video Summarization Framework Using Pupillary Response Signals

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

LVBench: An Extreme Long Video Understanding Benchmark

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Long Context Transfer from Language to Vision

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

TempCompass: Do Video LLMs Really Understand Videos?

Understanding Long Videos with Multimodal Language Models