Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6$\times$ the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \url{<a class="link-external link-https" href="https://github.com/QQ-MM/Video-CCAM" rel="external noopener nofollow">this https URL</a>}.

Class-attention Video Transformer for Engagement Intensity Prediction

Class-attention video transformer for engagement prediction

Engagement Detection in Online Learning Based on Pre-trained Vision Transformer and Temporal Convolutional Network

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Advanced Multi-Instance Learning Method with Multi-features Engineering and Conservative Optimization for Engagement Intensity Prediction

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Is Space-Time Attention All You Need for Video Understanding?

Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

HaViT: Hybrid-Attention Based Vision Transformer for Video Classification

A Video Classification Method Based on Spatiotemporal Detail Attention and Feature Fusion

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction

A Short Video Classification Framework Based on Cross-Modal Fusion

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

TRecViT: A Recurrent Video Transformer

MEViT: Motion Enhanced Video Transformer for Video Classification