Abstract:Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.

MemBridge: Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

TEVL: Trilinear Encoder for Video-language Representation Learning

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Adaptively Building a Video-language Model for Video Captioning and Retrieval Without Massive Video Pretraining

Vision-language pre-training via modal interaction

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Multimodal interaction enhanced representation learning for video emotion recognition

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Multimodal Memory Modelling for Video Captioning

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

i-Code: An Integrative and Composable Multimodal Learning Framework

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Jointly Modeling Embedding and Translation to Bridge Video and Language

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Memory-Based Augmentation Network for Video Captioning