Data Collection-free Masked Video Modeling

Yuchi Ishikawa,Masayoshi Kondo,Yoshimitsu Aoki

2024-09-11

Abstract:Pre-training video transformers generally requires a large amount of data, presenting significant challenges in terms of data collection costs and concerns related to privacy, licensing, and inherent biases. Synthesizing data is one of the promising ways to solve these issues, yet pre-training solely on synthetic data has its own challenges. In this paper, we introduce an effective self-supervised learning framework for videos that leverages readily available and less costly static images. Specifically, we define the Pseudo Motion Generator (PMG) module that recursively applies image transformations to generate pseudo-motion videos from images. These pseudo-motion videos are then leveraged in masked video modeling. Our approach is applicable to synthetic images as well, thus entirely freeing video pre-training from data collection costs and other concerns in real data. Through experiments in action recognition tasks, we demonstrate that this framework allows effective learning of spatio-temporal features through pseudo-motion videos, significantly improving over existing methods which also use static images and partially outperforming those using both real and synthetic videos. These results uncover fragments of what video transformers learn through masked video modeling.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main issues that this paper attempts to address are the high cost of video data collection, privacy issues, copyright and licensing issues, and data bias. Specifically: 1. **High cost of data collection**: Compared to audio, text, and images, video data is larger in volume, making the downloading, storage, and preprocessing of video data very expensive. 2. **Privacy issues**: Video data often contains personally identifiable information (such as faces), which raises significant privacy concerns. 3. **Copyright and licensing issues**: Video data may be collected without permission, thereby infringing on copyright and licensing. For example, some datasets are collected from video-sharing websites like YouTube, where videos are protected by the Standard YouTube License by default, prohibiting content downloads. 4. **Data bias**: Large-scale datasets may inadvertently contain biases, leading to issues related to nationality, gender, age, etc., affecting the fairness and inclusivity of models. To address these issues, the authors propose a self-supervised learning framework that uses static images to generate pseudo-motion videos, thereby completely avoiding the cost and related issues of video data collection. Through experiments, the authors demonstrate that this method can effectively learn spatiotemporal features and achieve significant performance improvements in action recognition tasks.

Data Collection-free Masked Video Modeling

Motion Guided Token Compression for Efficient Masked Video Modeling

Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Motion Predictors Are Strong 3D Action Representation Learners

Masked Feature Prediction for Self-Supervised Visual Pre-Training

MaskViT: Masked Visual Pre-Training for Video Prediction

Masked Motion Encoding for Self-Supervised Video Representation Learning

Learning Video Representations without Natural Videos

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.

Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Concatenated Masked Autoencoders as Spatial-Temporal Learner

VideoMAC: Video Masked Autoencoders Meet ConvNets

Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

Real-World Robot Learning with Masked Visual Pre-training

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

MGMAE: Motion Guided Masking for Video Masked Autoencoding

MV2MAE: Multi-View Video Masked Autoencoders

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking