TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Mu Cai,Reuben Tan,Jianrui Zhang,Bocheng Zou,Kai Zhang,Feng Yao,Fangrui Zhu,Jing Gu,Yiwu Zhong,Yuzhang Shang,Yao Dou,Jaden Park,Jianfeng Gao,Yong Jae Lee,Jianwei Yang
2024-10-16
Abstract:Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue in the field of video understanding where existing multimodal video benchmarks primarily focus on static image-level evaluation, lacking the assessment of fine-grained temporal dynamics in videos. Specifically, current benchmarks often do not have sufficient fine-grained temporal annotations, resulting in poor performance in evaluating models' temporal understanding capabilities. The paper proposes a new benchmark called TemporalBench, aimed at evaluating multimodal video models' understanding of fine-grained activities in videos. This benchmark contains approximately 10,000 question-answer pairs, derived from around 2,000 high-quality human annotations, detailing the temporal dynamics in video clips. Additionally, the paper points out a critical pitfall in multiple-choice evaluations: large language models can detect subtle variations in the correct answers to find a "centralized" description and use it as a prediction cue. Therefore, the authors propose a method called Multiple Binary Accuracy (MBA) to correct this bias. The research results indicate that even the most advanced models like GPT-4o have an accuracy of only 38.5% in answering questions on TemporalBench, showing a significant gap (about 30%) compared to humans, especially in understanding long-duration videos. This indicates that current models are still limited in understanding the fine-grained temporal relationships of objects and events in videos.