Abstract:Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a simple yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples show that our model produces more precise responses for long video understanding. Code is available at <a class="link-external link-https" href="https://github.com/ziplab/LongVLM" rel="external noopener nofollow">this https URL</a>.

Unifying Specialized Visual Encoders for Video Language Models

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

VideoLLM: Modeling Video Sequence with Large Language Models

Audio-Visual LLM for Video Understanding

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

EVLM: An Efficient Vision-Language Model for Visual Understanding

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Dense Connector for MLLMs

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

LLaViLo: Boosting Video Moment Retrieval Via Adapter-Based Multimodal Modeling

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Understanding Long Videos with Multimodal Language Models

HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

LongVLM: Efficient Long Video Understanding via Large Language Models

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations