Xingjian He,Sihan Chen,Fan Ma,Zhicheng Huang,Xiaojie Jin,Zikang Liu,Dongmei Fu,Yi Yang,Jing Liu,Jiashi Feng
Abstract:Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 61.0, and 79.0, respectively. Codes and models will be released.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to transfer the representational ability of powerful image - text pre - training models (such as CLIP) to video - text tasks, in order to build a unified and efficient multimodal model, thereby enhancing the ability of video - language understanding. Specifically, the paper aims to enhance the performance of CLIP when processing video data through two strategies, feature adaptation and feature fusion, and develop a general - purpose model applicable to multiple video - text tasks (such as video - text retrieval, video caption generation, and video question answering).
### Specific Background of the Problem
1. **Domain Differences**: Image - text models (such as CLIP) are mainly for static images, while video data has a time dimension, so directly applying these models to video tasks has poor results.
2. **Task Differences**: CLIP is mainly used for contrastive learning tasks, while video - language models need to handle both contrastive tasks and generation tasks (such as video caption generation and video question answering) simultaneously.
3. **Data Differences**: Image - text pre - training models usually rely on large - scale image - text pair datasets, while high - quality video - text pair data is relatively scarce and the training cost is higher.
### Solutions in the Paper
To solve the above problems, the authors propose VLAB (Video Language pre - training by feature Adapting and Blending), a new video - language pre - training method. VLAB improves the model performance through the following two key strategies:
1. **Feature Adapting**:
- A new video adapter module is introduced to make up for the deficiency of CLIP in modeling time information, so that the model can better handle video data.
- By introducing a multi - modal encoder, the model is enabled to have the ability to perform generation tasks.
2. **Feature Blending**:
- An end - to - end training method is proposed. By combining the advantages of image and video features, the performance of the model is further improved.
- The cross - attention mechanism is used to fuse video features and image features, enabling the model to automatically learn the optimal representation fusion pattern.
### Experimental Verification
The paper verifies the effectiveness and universality of VLAB through extensive experiments, including video - text retrieval, video caption generation, and video question answering tasks. The experimental results show that VLAB significantly outperforms the existing state - of - the - art methods on multiple benchmark datasets, especially achieving a new record in the video question answering task.
For example, on the MSRVTT, MSVD, and TGIF three datasets, VLAB G respectively achieves accuracies of 49.6%, 61.0%, and 79.0%, which are significantly better than other methods.
### Summary
The main contribution of this paper is to propose a new video - language pre - training method VLAB. Through the two strategies of feature adaptation and feature fusion, it successfully transfers the representational ability of image - text models to video tasks, builds a unified and efficient multimodal model, and significantly improves the performance of video - language understanding tasks.