Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts

Kastan Day,Daniel Christl,Rohan Salvi,Pranav Sriram

2023-03-25

Abstract:We present Video Pre-trained Transformer. VPT uses four SOTA encoder models from prior work to convert a video into a sequence of compact embeddings. Our backbone, based on a reference Flan-T5-11B architecture, learns a universal representation of the video that is a non-linear sum of the encoder models. It learns using an autoregressive causal language modeling loss by predicting the words spoken in YouTube videos. Finally, we evaluate on standard downstream benchmarks by training fully connected prediction heads for each task. To the best of our knowledge, this is the first use of multiple frozen SOTA models as encoders in an "embedding -> backbone -> prediction head" design pattern - all others have trained their own joint encoder models. Additionally, we include more modalities than the current SOTA, Merlot Reserve, by adding explicit Scene Graph information. For these two reasons, we believe it could combine the world's best open-source models to achieve SOTA performance. Initial experiments demonstrate the model is learning appropriately, but more experimentation and compute is necessary, and already in progress, to realize our loftier goals. Alongside this work, we build on the YT-20M dataset, reproducing it and adding 25,000 personally selected YouTube videos to its corpus. All code and model checkpoints are open sourced under a standard MIT license.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively use multi - modal information to enhance the ability of video understanding. Specifically, the paper introduces a new model named Video Pre - trained Transformer (VPT). This model aims to generate multi - modal representations of videos by combining image sequences, raw audio, automatically generated text subtitles, and scene graph information. VPT uses four state - of - the - art encoder models to convert videos into a series of compact embedding vectors. These embedding vectors are then input into a backbone network based on the Flan - T5 - 11B architecture to learn a general representation of the video. This representation is nonlinear and is the comprehensive result of the outputs of multiple encoder models. VPT is trained by predicting the dialogue content in YouTube videos, using an autoregressive causal language modeling loss function. Finally, VPT is evaluated on standard downstream benchmark tasks, achieving performance optimization by training fully - connected prediction heads for each task. In addition, the paper also explores how to enhance the performance of multi - modal video models by adding explicit scene graph information, which is a feature not available in the current state - of - the - art model, Merlot Reserve. The authors of the paper believe that by combining the world's best open - source models in this way, the best performance can be achieved. Preliminary experiments show that the model is learning in the expected way, but more experiments and computing resources are needed to achieve higher goals. At the same time, the research team has also extended the YT - 20M dataset by adding 25,000 carefully selected YouTube videos, and all code and model checkpoints are open - sourced under the MIT license.

Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts

TransVOS: Video Object Segmentation with Transformers

All in One: Exploring Unified Video-Language Pre-training

BEVT: BERT Pretraining of Video Transformers

Multiview Transformers for Video Recognition

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

SVT: Supertoken Video Transformer for Efficient Video Understanding

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

VideoGPT: Video Generation using VQ-VAE and Transformers

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

Token Shift Transformer for Video Classification

Streaming Video Model

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

TP-VIT: A Two-Pathway Vision Transformer for Video Action Recognition

Transformer Video Classification algorithm based on video token-to-token.

TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking