VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Xiaohan Lan,Yitian Yuan,Zequn Jie,Lin Ma
2024-10-15
Abstract:Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose VidCompress, a novel Video-LLM featuring memory-enhanced temporal compression. VidCompress employs a dual-compressor approach: a memory-enhanced compressor captures both short-term and long-term temporal relationships in videos and compresses the visual tokens using a multiscale transformer with a memory-cache mechanism, while a text-perceived compressor generates condensed visual tokens by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered in video understanding in large - scale language models (LLMs). Specifically, most existing video - language models (Video - LLMs) simply treat videos as a series of independent frames when processing videos. This approach results in insufficient spatio - temporal interaction, making it difficult to achieve fine - grained understanding, and also has difficulties in handling longer videos due to the limited capacity of visual tokens. To address these issues, the authors propose VidCompress, a novel Video - LLM with memory - enhanced temporal compression capabilities. VidCompress addresses these challenges through a dual - compressor approach: 1. **Memory - enhanced Compressor**: This compressor can capture short - term and long - term temporal relationships in videos and compress visual tokens using a multi - scale transformer and a memory - caching mechanism. 2. **Text - aware Compressor**: This compressor generates condensed visual tokens by leveraging Q - Former and integrating temporal context into query embeddings. Through this method, VidCompress can effectively model short - term correlations and long - term associations over time while maintaining efficiency, thereby significantly improving the performance of video - understanding tasks. Experimental results show that VidCompress performs well on multiple video - question - answering datasets and comprehensive benchmark tests, especially when handling long videos.