Abstract:This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length video with a constant number of video tokens streamingly encoded and adaptively selected. The challenge of video understanding in the vision language area mainly lies in the significant computational burden caused by the great number of tokens extracted from long videos. Previous works rely on sparse sampling or frame compression to reduce tokens. However, such approaches either disregard temporal information in a long time span or sacrifice spatial details, resulting in flawed compression. To address these limitations, our VideoStreaming has two core designs: Memory-Propagated Streaming Encoding and Adaptive Memory Selection. The Memory-Propagated Streaming Encoding architecture segments long videos into short clips and sequentially encodes each clip with a propagated memory. In each iteration, we utilize the encoded results of the preceding clip as historical memory, which is integrated with the current clip to distill a condensed representation that encapsulates the video content up to the current timestamp. After the encoding process, the Adaptive Memory Selection strategy selects a constant number of question-related memories from all the historical memories and feeds them into the LLM to generate informative responses. The question-related selection reduces redundancy within the memories, enabling efficient and precise video understanding. Meanwhile, the disentangled video extraction and reasoning design allows the LLM to answer different questions about a video by directly selecting corresponding memories, without the need to encode the whole video for each question. Our model achieves superior performance and higher efficiency on long video benchmarks, showcasing precise temporal comprehension for detailed question answering.

LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

ST-LLM: Large Language Models Are Effective Temporal Learners

Streaming Long Video Understanding with Large Language Models

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

LongVLM: Efficient Long Video Understanding via Large Language Models

Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

Retrieval-based Video Language Model for Efficient Long Video Question Answering

Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts

A Simple LLM Framework for Long-Range Video Question-Answering

Exploiting long-term temporal dynamics for video captioning

Efficient Transfer Learning for Video-language Foundation Models