Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6$\times$ the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \url{<a class="link-external link-https" href="https://github.com/QQ-MM/Video-CCAM" rel="external noopener nofollow">this https URL</a>}.

Context-aware focal alignment network for micro-video multi-label classification

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Dual-domain Aligned Deep Hierarchical Matrix Factorization Method for Micro-video Multi-label Classification

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Context Attention Fusion Network for crowd counting

Joint learning of video scene detection and annotation via multi-modal adaptive context network

Context-aware network with foreground recalibration for grounding natural language in video

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Group Contextualization for Video Recognition

User-Video Co-Attention Network for Personalized Micro-video Recommendation

Multi-attention Networks for Temporal Localization of Video-level Labels

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Spatial Context-Aware Object-Attentional Network for Multi-Label Image Classification

Multi-Granularity Context Network for Efficient Video Semantic Segmentation

Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Fusing Multi-Stream Deep Networks for Video Classification

A Short Video Classification Framework Based on Cross-Modal Fusion