Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6$\times$ the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \url{<a class="link-external link-https" href="https://github.com/QQ-MM/Video-CCAM" rel="external noopener nofollow">this https URL</a>}.

The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Context-Enhanced Video Moment Retrieval with Large Language Models

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Moment is Important: Language-Based Video Moment Retrieval Via Adversarial Learning

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

Transferable Video Moment Localization by Moment-Guided Query Prompting

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention.

Language Guided Networks for Cross-modal Moment Retrieval

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

LLaViLo: Boosting Video Moment Retrieval Via Adapter-Based Multimodal Modeling

Video Moment Retrieval with Noisy Labels

Understanding Long Videos with Multimodal Language Models

Video Moment Localization via Deep Cross-Modal Hashing

Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

Finding Moments in Video Collections Using Natural Language

Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Fine-Grained Modality Relation-Aware Network for Video Moment Retrieval