Abstract:Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a simple yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples show that our model produces more precise responses for long video understanding. Code is available at <a class="link-external link-https" href="https://github.com/ziplab/LongVLM" rel="external noopener nofollow">this https URL</a>.

LLM×MapReduce: Simplified Long-Sequence Processing Using Large Language Models

LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models

XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

SEGMENT+: Long Text Processing with Short-Context Language Models

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Extending Context Window of Large Language Models via Semantic Compression

FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding

Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly

LongVLM: Efficient Long Video Understanding via Large Language Models

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Efficiently Exploring Large Language Models for Document-Level Machine Translation with In-context Learning

Language Models can Self-Lengthen to Generate Long Texts

A Controlled Study on Long Context Extension and Generalization in LLMs

Training-Free Long-Context Scaling of Large Language Models

UniMem: Towards a Unified View of Long-Context Large Language Models

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

MM-LLMs: Recent Advances in MultiModal Large Language Models