Abstract:In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper explores how to learn useful video representations using synthetic videos and static images without relying on natural videos. Specifically, the authors propose a step-by-step generation of video datasets created through simple generation processes that gradually simulate more and more natural video properties (e.g., motion, acceleration, and shape transformation). The study finds that pre-training models on these generated datasets can significantly improve downstream task performance, even approaching the performance of models pre-trained on natural videos. ### Main Contributions 1. **Generated Video Datasets**: The authors designed a series of synthetic video generators with increasing complexity, from static circles to dynamic shapes, and then to accelerated transformations of textures and image crops. 2. **Pre-training Effectiveness**: By pre-training the VideoMAE model on these generated datasets, the authors found that the model's performance on the UCF101 action classification task could reach 97.2% of the performance of models pre-trained on natural videos. 3. **Generalization Ability**: When evaluating the model's performance on out-of-distribution datasets, the authors found that models pre-trained with synthetic data outperformed those pre-trained with natural videos in 11 out of 14 UCF101-P sub-datasets. 4. **Low-level Attribute Analysis**: By analyzing the low-level attributes of the generated datasets, the authors discovered correlations between frame diversity, frame similarity to natural data, and downstream task performance. ### Experimental Results - **UCF101 Action Classification**: The final model (accelerated transformation shapes and ImageNet crops) performed excellently in both ViT-B and ViT-L scales, even surpassing models pre-trained on natural videos. - **HMDB51 Action Classification**: On the HMDB51 dataset, models pre-trained with generated data also performed well, with the last two models outperforming those pre-trained on natural videos. - **Kinetics-400 Action Classification**: Despite the larger dataset and limited computational resources, models pre-trained with generated data still achieved competitive results, closing 86.5% of the gap between supervised training and self-supervised pre-training. - **Out-of-distribution Generalization**: On the UCF101-P dataset, models pre-trained with generated data outperformed those pre-trained on natural videos in 11 sub-datasets, demonstrating stronger generalization ability. ### Conclusion This paper demonstrates the effectiveness of pre-training video models using synthetic videos and static images, not only performing well on standard datasets but also showing better generalization on out-of-distribution datasets. This provides new ideas and methods for large-scale self-supervised video learning in the future.

Learning Video Representations without Natural Videos

Ivs-Net: Learning Human View Synthesis from Internet Videos

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Data Collection-free Masked Video Modeling

An Evaluation of Large Pre-Trained Models for Gesture Recognition using Synthetic Videos

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

Masked Motion Encoding for Self-Supervised Video Representation Learning

Temporally-Embedded Self-Supervised Video Representation Learning

Visual Data Synthesis Via GAN for Zero-Shot Video Classification

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Generative Models as a Data Source for Multiview Representation Learning

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

DistInit: Learning Video Representations Without a Single Labeled Video

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.

ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Training Robust Deep Physiological Measurement Models with Synthetic Video-based Data

Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Video Instruction Tuning With Synthetic Data