VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

Ruotong Liao,Max Erler,Huiyu Wang,Guangyao Zhai,Gengyuan Zhang,Yunpu Ma,Volker Tresp
2024-10-05
Abstract:In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for zero-shot long-form video understanding. VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs; (2) an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos; (3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. The code is released here: <a class="link-external link-https" href="https://github.com/mayhugotong/VideoINSTA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key challenges in long - video understanding: 1. **Information Quality**: Videos contain a large amount of information, some of which may be redundant due to slight visual changes. Identifying the most critical information and effectively extracting this information is crucial for enhancing the ability of large - language models (LLMs) to process videos. How can this extraction be achieved? 2. **Neglect of Spatial and Temporal Characteristics**: Videos are essentially temporal and spatial in nature. How can these spatio - temporal information be effectively retained and conveyed to support LLM reasoning? In particular, how do LLMs handle the temporal dynamics in videos? 3. **Imbalanced Information Inference Complexity over Time Span**: In long videos, the importance of information varies greatly along the video time axis. The implicit "intuition" of LLMs is not sufficient to handle all information. How can an explicit inference algorithm be developed to handle imbalanced information and take into account the time factor? To address these challenges, the authors propose a framework named VideoINSTA, that is, an **Information Spatio - Temporal Reasoning** framework for zero - shot long - video understanding. This framework aims to construct a composite system to extract key information from long videos and use spatio - temporal reasoning and time - aware self - reflective reasoning to handle complex spatio - temporal information. Specifically, VideoINSTA addresses the above problems through the following aspects: - **Event - based Temporal Reasoning**: An automatic temporal segmentation method C - DPCKNN is proposed to divide long videos into multiple events and inherit local temporal information through a unified temporal representation tool UniVTG and a temporal localization scheme. - **Content - based Spatial Reasoning**: By improving video captions, various visual - language captioning tools are used to extract richer spatial information, especially by using object detection and action captioning as supplements to spatial information. - **Iterative Information Reasoning**: Based on the self - evaluation of information sufficiency and prediction confidence by LLMs, the temporal and spatial information derived from the previous stage is iteratively merged. These methods jointly improve the performance of long - video understanding, especially in multiple - choice and open - question answering tasks, showing significant improvements.