Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6$\times$ the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \url{<a class="link-external link-https" href="https://github.com/QQ-MM/Video-CCAM" rel="external noopener nofollow">this https URL</a>}.

Dynamic Contrastive Learning with Pseudo-samples Intervention for Weakly Supervised Joint Video MR and HD

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Temporal Self-Paced Proposal Learning for Weakly-Supervised Video Moment Retrieval and Highlight Detection

Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

PointCMP: Contrastive Mask Prediction for Self-supervised Learning on Point Cloud Videos

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

MCT-VHD: Multi-modal contrastive transformer for video highlight detection

Video Contrastive Learning with Global Context

Contrastive Video-Language Learning with Fine-grained Frame Sampling

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

Few-Shot MS and PAN Joint Classification with Improved Cross-Source Contrastive Learning

Enhancing Contrastive Learning with Efficient Combinatorial Positive Pairing

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation