Abstract:In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for zero-shot long-form video understanding. VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs; (2) an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos; (3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. The code is released here: <a class="link-external link-https" href="https://github.com/mayhugotong/VideoINSTA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key challenges in long - video understanding: 1. **Information Quality**: Videos contain a large amount of information, some of which may be redundant due to slight visual changes. Identifying the most critical information and effectively extracting this information is crucial for enhancing the ability of large - language models (LLMs) to process videos. How can this extraction be achieved? 2. **Neglect of Spatial and Temporal Characteristics**: Videos are essentially temporal and spatial in nature. How can these spatio - temporal information be effectively retained and conveyed to support LLM reasoning? In particular, how do LLMs handle the temporal dynamics in videos? 3. **Imbalanced Information Inference Complexity over Time Span**: In long videos, the importance of information varies greatly along the video time axis. The implicit "intuition" of LLMs is not sufficient to handle all information. How can an explicit inference algorithm be developed to handle imbalanced information and take into account the time factor? To address these challenges, the authors propose a framework named VideoINSTA, that is, an **Information Spatio - Temporal Reasoning** framework for zero - shot long - video understanding. This framework aims to construct a composite system to extract key information from long videos and use spatio - temporal reasoning and time - aware self - reflective reasoning to handle complex spatio - temporal information. Specifically, VideoINSTA addresses the above problems through the following aspects: - **Event - based Temporal Reasoning**: An automatic temporal segmentation method C - DPCKNN is proposed to divide long videos into multiple events and inherit local temporal information through a unified temporal representation tool UniVTG and a temporal localization scheme. - **Content - based Spatial Reasoning**: By improving video captions, various visual - language captioning tools are used to extract richer spatial information, especially by using object detection and action captioning as supplements to spatial information. - **Iterative Information Reasoning**: Based on the self - evaluation of information sufficiency and prediction confidence by LLMs, the temporal and spatial information derived from the previous stage is iteratively merged. These methods jointly improve the performance of long - video understanding, especially in multiple - choice and open - question answering tasks, showing significant improvements.

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

A Simple LLM Framework for Long-Range Video Question-Answering

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

LongVLM: Efficient Long Video Understanding via Large Language Models

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

LITA: Language Instructed Temporal-Localization Assistant

VideoQA in the Era of LLMs: An Empirical Study

Retrieval-based Video Language Model for Efficient Long Video Question Answering

LLMs Meet Long Video: Advancing Long Video Comprehension with an Interactive Visual Adapter in LLMs.

Streaming Long Video Understanding with Large Language Models

Video Understanding with Large Language Models: A Survey

Discovering Spatio-Temporal Rationales for Video Question Answering

VideoLLM: Modeling Video Sequence with Large Language Models

Understanding Long Videos with Multimodal Language Models

ViLLa: Video Reasoning Segmentation with Large Language Model

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Koala: Key frame-conditioned long video-LLM