Masked Autoencoders As Spatiotemporal Learners

Christoph Feichtenhofer,Haoqi Fan,Yanghao Li,Kaiming He
2022-10-21
Abstract:This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is spatio - temporal representation learning in video data. Specifically, the authors extended the Masked Autoencoders (MAE) method to video data. By randomly masking spatio - temporal blocks in the video and training an auto - encoder to reconstruct these masked parts. This method aims to use the least domain knowledge (i.e., hardly introduce spatio - temporal - specific inductive biases) to learn powerful representations from video data. The key points of the paper include: 1. **Spatio - temporal masking**: Randomly mask spatio - temporal blocks in the video at a ratio of up to 90%. This is a higher masking ratio than that on images because video data has a higher information redundancy. 2. **Auto - encoder design**: Use simple Vision Transformers as the encoder and decoder. The encoder only processes the visible spatio - temporal blocks, while the decoder processes the encoded blocks and the masking tokens to reconstruct the original input. 3. **Efficiency**: A high masking ratio significantly reduces the time and memory complexity of the encoder, thereby improving the training efficiency. For example, a 90% masking ratio can reduce the amount of computation to less than 1/10 of the original. 4. **Experimental results**: Experiments were carried out on multiple video datasets, and the results showed that MAE pre - training can significantly improve the generalization performance of the model, even exceeding the supervised pre - training method. Overall, the main goal of this paper is to explore how to apply Masked Autoencoders on video data to achieve efficient spatio - temporal representation learning and demonstrate the potential of this method in practical applications.