Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6$\times$ the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \url{<a class="link-external link-https" href="https://github.com/QQ-MM/Video-CCAM" rel="external noopener nofollow">this https URL</a>}.

DCA: Diversified Co-attention Towards Informative Live Video Commenting.

Enhancing Multimodal Affective Analysis with Learned Live Comment Features

VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal Contexts

Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting

Cross-Modal Commentator: Automatic Machine Commenting Based on Cross-Modal Information.

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding.

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

Live Video Comment Generation Based on Surrounding Frames and Live Comments

Knowledge Enhanced Model for Live Video Comment Generation

ViCo: Engaging Video Comment Generation with Human Preference Rewards

LiveChat: Video Comment Generation from Audio-Visual Multimodal Contexts

Divided Caption Model with Global Attention

Response to LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts

Video emotion analysis enhanced by recognizing emotion in video comments

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Frame Augmented Alternating Attention Network for Video Question Answering.

Bridging Video Content And Comments: Synchronized Video Description With Temporal Summarization Of Crowdsourced Time-Sync Comments

Non-Autoregressive Video Captioning with Iterative Refinement

Exploring global diverse attention via pairwise temporal relation for video summarization

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark