Abstract:Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to effectively process and understand long - time videos to overcome the challenges caused by the context length limitation of existing large - scale language models (LLMs) when processing long videos**. Specifically, although the existing multi - modal large language models (MLLMs) have made remarkable progress in understanding and analyzing video content, they have encountered bottlenecks when processing long - time videos (such as videos exceeding 1 hour). The main reasons are: 1. **Context length limitation**: Advanced MLLMs require hundreds of tokens to represent a single image. For example, LLaVA - 1.6 uses 576 to 2,880 tokens to represent an image, and LLaVA - OneVision uses 7,290 tokens. However, the context length used in common multi - modal training is 8k tokens, which can only process approximately 125 frames (about 2 minutes of video), while an hour - long video may require more than 200k tokens. 2. **Computational resource limitation**: Processing long - time videos requires a large amount of GPU memory, making training infeasible. 3. **Limitations of existing methods**: Most existing methods uniformly sample a fixed number of video frames as input, ignoring the non - uniform content in the video (such as static and dynamic scenes), resulting in key frames being ignored or information loss. To solve these problems, the paper proposes **LongVU**, a spatio - temporal adaptive compression mechanism, which aims to reduce the number of video tokens while retaining visual details. The specific objectives of LongVU include: - **Reducing redundant frames**: Removing highly similar redundant frames by using DINOv2 features. - **Selective frame feature compression**: Selectively reducing frame features based on text - guided cross - modal queries. - **Time - dependent spatial token compression**: Further compressing spatial tokens to adapt to the given context length. Through these methods, LongVU can efficiently process long - time videos without exceeding the context length of common LLMs and has achieved significantly better results than existing methods in multiple video understanding benchmark tests.

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Streaming Long Video Understanding with Large Language Models

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Long Context Transfer from Language to Vision

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Understanding Long Videos with Multimodal Language Models

Visual Context Window Extension: A New Perspective for Long Video Understanding

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

VoCo-LLaMA: Towards Vision Compression with Large Language Models

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Retrieval-based Video Language Model for Efficient Long Video Question Answering

LVBench: An Extreme Long Video Understanding Benchmark

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding