Masked Autoencoders As Spatiotemporal Learners

Christoph Feichtenhofer,Haoqi Fan,Yanghao Li,Kaiming He

2022-10-21

Abstract:This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is spatio - temporal representation learning in video data. Specifically, the authors extended the Masked Autoencoders (MAE) method to video data. By randomly masking spatio - temporal blocks in the video and training an auto - encoder to reconstruct these masked parts. This method aims to use the least domain knowledge (i.e., hardly introduce spatio - temporal - specific inductive biases) to learn powerful representations from video data. The key points of the paper include: 1. **Spatio - temporal masking**: Randomly mask spatio - temporal blocks in the video at a ratio of up to 90%. This is a higher masking ratio than that on images because video data has a higher information redundancy. 2. **Auto - encoder design**: Use simple Vision Transformers as the encoder and decoder. The encoder only processes the visible spatio - temporal blocks, while the decoder processes the encoded blocks and the masking tokens to reconstruct the original input. 3. **Efficiency**: A high masking ratio significantly reduces the time and memory complexity of the encoder, thereby improving the training efficiency. For example, a 90% masking ratio can reduce the amount of computation to less than 1/10 of the original. 4. **Experimental results**: Experiments were carried out on multiple video datasets, and the results showed that MAE pre - training can significantly improve the generalization performance of the model, even exceeding the supervised pre - training method. Overall, the main goal of this paper is to explore how to apply Masked Autoencoders on video data to achieve efficient spatio - temporal representation learning and demonstrate the potential of this method in practical applications.

Masked Autoencoders As Spatiotemporal Learners

Masked Autoencoders for Point Cloud Self-supervised Learning.

Concatenated Masked Autoencoders as Spatial-Temporal Learner

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Masked Autoencoders Are Scalable Vision Learners

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

Improving Masked Autoencoders by Learning Where to Mask

Ti-MAE: Self-Supervised Masked Time Series Autoencoders

TimeMAE: Self-Supervised Representations of Time Series with Decoupled Masked Autoencoders

TS-MAE: A masked autoencoder for time series representation learning

Understanding Masked Autoencoders From a Local Contrastive Perspective

Spatio-Temporal Encoding of Brain Dynamics with Surface Masked Autoencoders

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Graph Masked Autoencoder for Spatio-Temporal Graph Learning

How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders

Masked Autoencoders in 3D Point Cloud Representation Learning

Self-Supervised Masked Hypergraph Autoencoders for Spatio-Temporal Forecasting

MGMAE: Motion Guided Masking for Video Masked Autoencoding

VideoMAC: Video Masked Autoencoders Meet ConvNets