xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Michael S. Ryoo,Honglu Zhou,Shrikant Kendre,Can Qin,Le Xue,Manli Shu,Silvio Savarese,Ran Xu,Caiming Xiong,Juan Carlos Niebles
2024-10-22
Abstract:We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at <a class="link-external link-https" href="https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main challenges faced by video multimodal language models (VLMs) when processing videos: 1. **Efficient compression of visual information**: In traditional methods of processing videos, each frame of an image is usually converted into a large number of visual tokens, which leads to a sharp increase in computational complexity and memory consumption. For example, for an 8 - frame video, some existing models may require thousands of visual tokens to represent the entire video. Such a high - dimensional representation not only increases the computational burden but may also lead to inefficient model training and inference. 2. **Effective capture of temporal information**: A video, unlike a static image, contains dynamic changes in the time dimension. Therefore, how to effectively capture and abstract the temporal information in the video to ensure that the model can understand the actions, events, and their order in the video is another key challenge. To address these two problems, the paper proposes a new model - **xGen - MM - Vid (BLIP - 3 - Video)**. By introducing an explicit **temporal encoder**, this model can still maintain the ability to understand video content while significantly reducing the number of visual tokens. Specifically, the temporal encoder used in BLIP - 3 - Video can compress the visual tokens of multiple frames into a small number of video - level tokens (for example, 32 tokens), thereby significantly improving the computational efficiency and performance of the model. ### Main contributions 1. **Efficient visual token compression**: By using the temporal encoder, BLIP - 3 - Video can represent a video with only 32 visual tokens, which is a significant reduction in the number of tokens compared to other models (such as 4608 tokens). 2. **Strong ability to capture temporal information**: The temporal encoder can effectively capture the temporal dependencies and dynamic changes in the video, making the model perform well in tasks such as video question answering and caption generation. 3. **Experimental verification**: The experimental results show that BLIP - 3 - Video performs as well as or even better than larger - scale models on multiple public datasets, while having fewer parameters (4B vs 34B) and higher computational efficiency. In conclusion, this paper provides a more efficient and compact way to process video data by introducing a temporal encoder, solving the problems of computational complexity and temporal information capture faced by existing models when processing videos.