Abstract:This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length video with a constant number of video tokens streamingly encoded and adaptively selected. The challenge of video understanding in the vision language area mainly lies in the significant computational burden caused by the great number of tokens extracted from long videos. Previous works rely on sparse sampling or frame compression to reduce tokens. However, such approaches either disregard temporal information in a long time span or sacrifice spatial details, resulting in flawed compression. To address these limitations, our VideoStreaming has two core designs: Memory-Propagated Streaming Encoding and Adaptive Memory Selection. The Memory-Propagated Streaming Encoding architecture segments long videos into short clips and sequentially encodes each clip with a propagated memory. In each iteration, we utilize the encoded results of the preceding clip as historical memory, which is integrated with the current clip to distill a condensed representation that encapsulates the video content up to the current timestamp. After the encoding process, the Adaptive Memory Selection strategy selects a constant number of question-related memories from all the historical memories and feeds them into the LLM to generate informative responses. The question-related selection reduces redundancy within the memories, enabling efficient and precise video understanding. Meanwhile, the disentangled video extraction and reasoning design allows the LLM to answer different questions about a video by directly selecting corresponding memories, without the need to encode the whole video for each question. Our model achieves superior performance and higher efficiency on long video benchmarks, showcasing precise temporal comprehension for detailed question answering.

What problem does this paper attempt to address?

The paper attempts to address the challenges of long video understanding in the visual language domain. Specifically, existing methods face two main issues when dealing with long videos: 1. **Heavy computational burden**: Long videos contain a large number of frames, and extracting features from these frames requires substantial computational resources, leading to high computational costs. 2. **Information loss**: To reduce the computational burden, existing methods often employ techniques such as sparse sampling or frame compression. However, these methods either ignore the temporal information over long durations or sacrifice spatial details, resulting in incomplete compressed information. To address these issues, the paper proposes a novel model named **VideoStreaming**. This model achieves efficient and accurate long video understanding through the following two core designs: 1. **Memory-Propagated Streaming Encoding**: - The long video is divided into multiple short segments, and each segment is encoded sequentially. - In each iteration, the encoding result of the previous segment is used as historical memory, combined with the features of the current segment to generate a condensed representation that includes the video content up to the current timestamp. - This method not only considers long-term temporal dynamics but also generates a fixed-length memory that can represent videos of any length. 2. **Adaptive Memory Selection**: - After encoding, a fixed number of memories relevant to the specific problem are selected from all historical memories. - By selecting problem-relevant historical memories, redundant information is reduced, achieving efficient and precise video understanding. - This design allows the model to directly select the corresponding memories based on different problems without re-encoding the entire video. Through these designs, the VideoStreaming model demonstrates superior performance and higher inference efficiency in multiple long video benchmarks, particularly excelling in detailed question answering.

Streaming Long Video Understanding with Large Language Models

LongVLM: Efficient Long Video Understanding via Large Language Models

Retrieval-based Video Language Model for Efficient Long Video Question Answering

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Video Understanding with Large Language Models: A Survey

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

VideoLLM-online: Online Video Large Language Model for Streaming Video

Understanding Long Videos with Multimodal Language Models

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

VideoLLM: Modeling Video Sequence with Large Language Models

Koala: Key frame-conditioned long video-LLM

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

LLMs Meet Long Video: Advancing Long Video Comprehension with an Interactive Visual Adapter in LLMs.

Long Context Transfer from Language to Vision

Visual Context Window Extension: A New Perspective for Long Video Understanding

ST-LLM: Large Language Models Are Effective Temporal Learners

Enhancing Long Video Understanding via Hierarchical Event-Based Memory