LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen,Yunyang Xiong,Changsheng Zhao,Lemeng Wu,Jun Chen,Chenchen Zhu,Zechun Liu,Fanyi Xiao,Balakrishnan Varadarajan,Florian Bordes,Zhuang Liu,Hu Xu,Hyunwoo J. Kim,Bilge Soran,Raghuraman Krishnamoorthi,Mohamed Elhoseiny,Vikas Chandra
2024-10-23
Abstract:Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to effectively process and understand long - time videos to overcome the challenges caused by the context length limitation of existing large - scale language models (LLMs) when processing long videos**. Specifically, although the existing multi - modal large language models (MLLMs) have made remarkable progress in understanding and analyzing video content, they have encountered bottlenecks when processing long - time videos (such as videos exceeding 1 hour). The main reasons are: 1. **Context length limitation**: Advanced MLLMs require hundreds of tokens to represent a single image. For example, LLaVA - 1.6 uses 576 to 2,880 tokens to represent an image, and LLaVA - OneVision uses 7,290 tokens. However, the context length used in common multi - modal training is 8k tokens, which can only process approximately 125 frames (about 2 minutes of video), while an hour - long video may require more than 200k tokens. 2. **Computational resource limitation**: Processing long - time videos requires a large amount of GPU memory, making training infeasible. 3. **Limitations of existing methods**: Most existing methods uniformly sample a fixed number of video frames as input, ignoring the non - uniform content in the video (such as static and dynamic scenes), resulting in key frames being ignored or information loss. To solve these problems, the paper proposes **LongVU**, a spatio - temporal adaptive compression mechanism, which aims to reduce the number of video tokens while retaining visual details. The specific objectives of LongVU include: - **Reducing redundant frames**: Removing highly similar redundant frames by using DINOv2 features. - **Selective frame feature compression**: Selectively reducing frame features based on text - guided cross - modal queries. - **Time - dependent spatial token compression**: Further compressing spatial tokens to adapt to the given context length. Through these methods, LongVU can efficiently process long - time videos without exceeding the context length of common LLMs and has achieved significantly better results than existing methods in multiple video understanding benchmark tests.