Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Zijia Zhao,Haoyu Lu,Yuqi Huo,Yifan Du,Tongtian Yue,Longteng Guo,Bingning Wang,Weipeng Chen,Jing Liu
2024-10-24
Abstract:Video understanding is a crucial next step for multimodal large language models (MLLMs). Various benchmarks are introduced for better evaluating the MLLMs. Nevertheless, current video benchmarks are still inefficient for evaluating video models during iterative development due to the high cost of constructing datasets and the difficulty in isolating specific skills. In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples video content from their query-responses by inserting unrelated visual 'needles' into original videos. The framework automates the generation of query-response pairs using predefined rules, minimizing manual labor. The queries focus on specific aspects of video understanding, enabling more skill-specific evaluations. The separation between video content and the queries also allow for increased video variety and evaluations across different lengths. Utilizing VideoNIAH, we compile a video benchmark VNBench, which includes tasks such as retrieval, ordering, and counting to evaluate three key aspects of video understanding: temporal perception, chronological ordering, and spatio-temporal coherence. We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities across various tasks. Additionally, we perform an in-depth analysis of the test results and model configurations. Based on these findings, we provide some advice for improving video MLLM training, offering valuable insights to guide future research and model development. The code and data are available at <a class="link-external link-https" href="https://github.com/joez17/VideoNIAH" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the evaluation challenges faced by video understanding models during iterative development. Specifically, there are two major problems in existing video benchmark tests: 1. **High cost of constructing high - quality datasets**: Creating real - world video datasets requires a large amount of human and time investment, including tasks such as video selection, annotation, and filtering. This makes the dataset construction process complex and time - consuming. 2. **Difficulty in separating the evaluation of specific skills**: Existing video benchmark tests often evaluate multiple aspects of video understanding ability simultaneously, making it difficult to accurately identify the weaknesses of a model in a specific skill. For example, a video may require the model to have multiple abilities such as OCR (Optical Character Recognition), object detection, vocabulary knowledge, temporal order, and temporal reasoning at the same time. To solve these problems, the author proposes a framework named **VideoNIAH (Video Needle InA Haystack)**. This framework inserts irrelevant visual "needles" into the original video through synthetic video generation technology, thereby decoupling the video content from its query - response pairs. Specific features are as follows: - **Automatically generate query - response pairs**: Automatically generate query - response pairs using predefined rules, reducing manual labor. - **Focus on specific aspects**: Queries focus on specific aspects of video understanding, such as temporal perception, temporal order, and spatio - temporal coherence, making the evaluation more precise. - **Increase video diversity**: The decoupled video content allows for more video types and length changes, increasing the flexibility of evaluation. Based on the VideoNIAH framework, the author constructs a video benchmark test set **VNBench**, which includes three tasks: retrieval, ranking, and counting, evaluating three aspects of video understanding respectively: temporal perception, temporal order, and spatio - temporal coherence. Through a comprehensive evaluation of 12 video understanding models (including 3 proprietary models and 9 open - source models), the author discovers significant differences among different models in various tasks and provides improvement suggestions. In summary, this paper aims to solve the problems of inefficiency and difficulty in skill separation in the evaluation of existing video understanding models by proposing the VideoNIAH framework and the VNBench benchmark test set, providing valuable guidance for future video understanding research.