Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6$\times$ the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \url{<a class="link-external link-https" href="https://github.com/QQ-MM/Video-CCAM" rel="external noopener nofollow">this https URL</a>}.

XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning

Semantic Tag Augmented XlanV Model for Video Captioning

Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid Reward Strategies for Video Captioning

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

Adaptively Building a Video-language Model for Video Captioning and Retrieval Without Massive Video Pretraining

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Multi-scale features with temporal information guidance for video captioning

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Augmented Partial Mutual Learning with Frame Masking for Video Captioning

Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Measuring apoptosis in neural stem cells.

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021

Fused GRU with Semantic-Temporal Attention for Video Captioning.

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Exploring the Role of Audio in Video Captioning

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Efficient Transfer Learning for Video-language Foundation Models

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion