Abstract:We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at <a class="link-external link-https" href="https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main challenges faced by video multimodal language models (VLMs) when processing videos: 1. **Efficient compression of visual information**: In traditional methods of processing videos, each frame of an image is usually converted into a large number of visual tokens, which leads to a sharp increase in computational complexity and memory consumption. For example, for an 8 - frame video, some existing models may require thousands of visual tokens to represent the entire video. Such a high - dimensional representation not only increases the computational burden but may also lead to inefficient model training and inference. 2. **Effective capture of temporal information**: A video, unlike a static image, contains dynamic changes in the time dimension. Therefore, how to effectively capture and abstract the temporal information in the video to ensure that the model can understand the actions, events, and their order in the video is another key challenge. To address these two problems, the paper proposes a new model - **xGen - MM - Vid (BLIP - 3 - Video)**. By introducing an explicit **temporal encoder**, this model can still maintain the ability to understand video content while significantly reducing the number of visual tokens. Specifically, the temporal encoder used in BLIP - 3 - Video can compress the visual tokens of multiple frames into a small number of video - level tokens (for example, 32 tokens), thereby significantly improving the computational efficiency and performance of the model. ### Main contributions 1. **Efficient visual token compression**: By using the temporal encoder, BLIP - 3 - Video can represent a video with only 32 visual tokens, which is a significant reduction in the number of tokens compared to other models (such as 4608 tokens). 2. **Strong ability to capture temporal information**: The temporal encoder can effectively capture the temporal dependencies and dynamic changes in the video, making the model perform well in tasks such as video question answering and caption generation. 3. **Experimental verification**: The experimental results show that BLIP - 3 - Video performs as well as or even better than larger - scale models on multiple public datasets, while having fewer parameters (4B vs 34B) and higher computational efficiency. In conclusion, this paper provides a more efficient and compact way to process video data by introducing a temporal encoder, solving the problems of computational complexity and temporal information capture faced by existing models when processing videos.

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Number it: Temporal Grounding Videos like Flipping Manga

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Long Context Transfer from Language to Vision

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

MM-VID: Advancing Video Understanding with GPT-4V(ision)

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs