Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts

Kastan Day,Daniel Christl,Rohan Salvi,Pranav Sriram
2023-03-25
Abstract:We present Video Pre-trained Transformer. VPT uses four SOTA encoder models from prior work to convert a video into a sequence of compact embeddings. Our backbone, based on a reference Flan-T5-11B architecture, learns a universal representation of the video that is a non-linear sum of the encoder models. It learns using an autoregressive causal language modeling loss by predicting the words spoken in YouTube videos. Finally, we evaluate on standard downstream benchmarks by training fully connected prediction heads for each task. To the best of our knowledge, this is the first use of multiple frozen SOTA models as encoders in an "embedding -> backbone -> prediction head" design pattern - all others have trained their own joint encoder models. Additionally, we include more modalities than the current SOTA, Merlot Reserve, by adding explicit Scene Graph information. For these two reasons, we believe it could combine the world's best open-source models to achieve SOTA performance. Initial experiments demonstrate the model is learning appropriately, but more experimentation and compute is necessary, and already in progress, to realize our loftier goals. Alongside this work, we build on the YT-20M dataset, reproducing it and adding 25,000 personally selected YouTube videos to its corpus. All code and model checkpoints are open sourced under a standard MIT license.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively use multi - modal information to enhance the ability of video understanding. Specifically, the paper introduces a new model named Video Pre - trained Transformer (VPT). This model aims to generate multi - modal representations of videos by combining image sequences, raw audio, automatically generated text subtitles, and scene graph information. VPT uses four state - of - the - art encoder models to convert videos into a series of compact embedding vectors. These embedding vectors are then input into a backbone network based on the Flan - T5 - 11B architecture to learn a general representation of the video. This representation is nonlinear and is the comprehensive result of the outputs of multiple encoder models. VPT is trained by predicting the dialogue content in YouTube videos, using an autoregressive causal language modeling loss function. Finally, VPT is evaluated on standard downstream benchmark tasks, achieving performance optimization by training fully - connected prediction heads for each task. In addition, the paper also explores how to enhance the performance of multi - modal video models by adding explicit scene graph information, which is a feature not available in the current state - of - the - art model, Merlot Reserve. The authors of the paper believe that by combining the world's best open - source models in this way, the best performance can be achieved. Preliminary experiments show that the model is learning in the expected way, but more experiments and computing resources are needed to achieve higher goals. At the same time, the research team has also extended the YT - 20M dataset by adding 25,000 carefully selected YouTube videos, and all code and model checkpoints are open - sourced under the MIT license.