VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Xiaohan Lan,Yitian Yuan,Zequn Jie,Lin Ma

2024-10-15

Abstract:Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose VidCompress, a novel Video-LLM featuring memory-enhanced temporal compression. VidCompress employs a dual-compressor approach: a memory-enhanced compressor captures both short-term and long-term temporal relationships in videos and compresses the visual tokens using a multiscale transformer with a memory-cache mechanism, while a text-perceived compressor generates condensed visual tokens by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.

Computer Vision and Pattern Recognition,Multimedia

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges encountered in video understanding in large - scale language models (LLMs). Specifically, most existing video - language models (Video - LLMs) simply treat videos as a series of independent frames when processing videos. This approach results in insufficient spatio - temporal interaction, making it difficult to achieve fine - grained understanding, and also has difficulties in handling longer videos due to the limited capacity of visual tokens. To address these issues, the authors propose VidCompress, a novel Video - LLM with memory - enhanced temporal compression capabilities. VidCompress addresses these challenges through a dual - compressor approach: 1. **Memory - enhanced Compressor**: This compressor can capture short - term and long - term temporal relationships in videos and compress visual tokens using a multi - scale transformer and a memory - caching mechanism. 2. **Text - aware Compressor**: This compressor generates condensed visual tokens by leveraging Q - Former and integrating temporal context into query embeddings. Through this method, VidCompress can effectively model short - term correlations and long - term associations over time while maintaining efficiency, thereby significantly improving the performance of video - understanding tasks. Experimental results show that VidCompress performs well on multiple video - question - answering datasets and comprehensive benchmark tests, especially when handling long videos.

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

VoCo-LLaMA: Towards Vision Compression with Large Language Models

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Efficient Large Multi-modal Models via Visual Context Compression

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

VcLLM: Video Codecs are Secretly Tensor Codecs

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Streaming Long Video Understanding with Large Language Models

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

High Efficiency Image Compression for Large Visual-Language Models

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

LongVLM: Efficient Long Video Understanding via Large Language Models

Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding