Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6$\times$ the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \url{<a class="link-external link-https" href="https://github.com/QQ-MM/Video-CCAM" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### The problems the paper attempts to solve The paper "Video - CCAM: Enhancing Video - Language Understanding with Causal Cross - Attention Masks for Short and Long Videos" aims to solve several key problems in video - language understanding: 1. **Handling a large number of visual tokens**: - Videos have an additional temporal dimension compared to images, resulting in the number of visual tokens being proportional not only to the spatial resolution but also to the number of video frames. This makes it difficult for existing large language models (LLMs) to handle a large number of visual tokens within a limited context length. - Existing methods usually solve this problem by down - sampling visual features or expanding the context size of the LLM, but these methods either lose high - resolution information or slow down the inference speed. 2. **Maintaining temporal order**: - The traditional cross - attention mechanism is insensitive to temporal order, which means that each query can indiscriminately attend to all spatio - temporal visual tokens. This is a major drawback for video understanding because the temporal order in videos is crucial for accurately understanding the content. 3. **Handling long videos**: - Existing video multi - modal large language models (Video - MLLMs) perform poorly when handling long videos because they are mainly trained on short videos. Therefore, a method that can directly adapt to long - video understanding is required. ### Solutions To address the above challenges, the authors propose the **Video - CCAM** model, whose main innovations include: 1. **Causal Cross - Attention Masks (CCAMs)**: - Introduce causal cross - attention masks in the cross - attention layer to give queries a temporal order, thereby enhancing the model's ability to understand videos. - Specifically, CCAMs ensure that each query can only attend to its previous or current frames, thus preserving the temporal order of the video. 2. **Flexible projector structure**: - The projector uses a fixed number of queries to handle videos with different numbers of frames, so it can handle a large number of video frames without exceeding the context length. - By reducing the number of layers and increasing the number of queries, the projector structure is simplified, making it easier to train. 3. **Two - stage training strategy**: - First stage: Use image - text data for pre - training, randomly initialize the CCAM projector, and connect it between the pre - trained visual encoder and the LLM. - Second stage: Use image - text and video - text pairs for visual instruction tuning to further optimize the model performance. ### Experimental results The authors evaluated the performance of Video - CCAM on multiple benchmarks, including MVBench, VideoVista, MLVU, VideoChatGPT - QA, and Video - MME. The experimental results show: - **MVBench**: Video - CCAM - 4B outperforms all previous multi - modal large language models (MLLMs) on multiple tasks, and Video - CCAM - 9B achieves new best results. - **VideoVista**: Video - CCAM - 4B performs best among all open - source Video - MLLMs, and Video - CCAM - 14B achieves new best results among open - source Video - MLLMs, with performance close to GPT - 4o - mini and Gemini 1.5 Flash. - **MLVU**: Despite the challenges in handling long - time videos, Video - CCAM still performs well, especially in generation tasks. - **VideoChatGPT - QA**: Video - CCAM - 4B outperforms all previous works on multiple subtasks, and Video - CCAM - 14B further narrows the gap with large - scale models. - **Video - MME**: Video - CCAM - 4B is competitive, only slightly inferior to the larger InternVL - Chat - V1.5, and Video - CCAM - 14B ranks highest among all open - source MLLMs with less than 30B parameters. ### Conclusion

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

LongVLM: Efficient Long Video Understanding via Large Language Models

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Understanding Long Videos with Multimodal Language Models

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Long Context Transfer from Language to Vision

Visual Context Window Extension: A New Perspective for Long Video Understanding

LLMs Meet Long Video: Advancing Long Video Comprehension with an Interactive Visual Adapter in LLMs.

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

VideoLLM: Modeling Video Sequence with Large Language Models

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM