Abstract:Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "Motion - Guided Masking for Spatiotemporal Representation Learning" aims to solve two main problems in video self - supervised learning: 1. **Deficiencies of random masking strategies**: - Existing video self - supervised learning methods usually directly extend the random masking auto - encoder (MAE) in the image domain, that is, applying the random masking strategy in images to videos. However, unlike static images, videos contain important spatiotemporal information. Therefore, the random masking strategy inherited from image MAE has a poor effect in videos because it cannot effectively utilize the salient regions of videos. - Specifically, the random masking strategy in videos will lead to the retention of redundant information, making the reconstruction task too easy, so a higher masking ratio is required to increase the task difficulty. However, a high masking ratio will reduce the number of visible patches, limiting the model's ability to learn spatiotemporal saliency. 2. **Improving the efficiency and performance of video self - supervised learning**: - The authors propose a new masking algorithm - Motion - Guided Masking (MGM), which uses the motion vectors in videos to guide the position of the mask. In this way, MGM can more efficiently utilize the salient regions of videos, thereby improving the learning efficiency and performance of the model. - Compared with existing video MAE methods, MGM not only achieves better performance on multiple large - scale video benchmark datasets (such as Kinetics - 400 and Something - Something V2), but also performs excellently in terms of training efficiency, being able to achieve the same or better performance in fewer training epochs. ### Main contributions 1. **Proposing an efficient self - supervised algorithm MGM**: - MGM realizes 3D masking of videos by continuously modeling motion trajectories. - It utilizes the existing motion vectors in the H.264 codec to provide efficient motion guidance without additional computational cost. 2. **New state or comparable results on large - scale datasets**: - On the two large - scale datasets, Kinetics - 400 and Something - Something V2, MGM achieves new best or comparable results. - On the three small - scale datasets, UCF101, HMDB51 and Diving48, MGM performs excellently in both full - fine - tuning and linear - probe evaluation as well as domain - adaptation settings, with a performance improvement of up to 4.9%. 3. **Detailed ablation experiments and insights**: - The authors verify the effectiveness of MGM through detailed experiments and provide in - depth analysis and insights into the model performance. ### Method overview 1. **Preliminary concepts**: - The motion in videos is concentrated in salient regions, and random masking cannot effectively cover these regions because the motion is unevenly distributed throughout the video. - In order to make the model learn spatiotemporal semantics more effectively, MGM generates a continuously moving 3D mask, ensuring that the mask remains continuous in both time and space. 2. **Simulated Motion Masking (SMM)**: - By randomly initializing a rectangular mask and propagating the mask using a recurrence relation, a dense moving 3D mask is generated. - However, the mask generated by SMM may not reflect the real motion because it does not have context - awareness. 3. **Motion - Guided Masking (MGM)**: - Utilize the motion vectors in the H.264 codec to ensure that the movement of the mask always covers the motion between frames. - Upsample the motion vector map to match the spatial resolution of the video, and then adjust the position and size of the mask according to the magnitude of the motion vectors. ### Experimental results 1. **Performance on large - scale datasets**: - On the Something - Something V2 and Kinetics - 400 datasets, MGM improves by 1.3% and 0.2% respectively compared to the previous best methods. - Compared with VideoMAE, MGM requires 50% fewer training epochs for the same performance. 2. **Performance on small - scale datasets**: - On UCF101, HMDB51 and Diving48

Motion-Guided Masking for Spatiotemporal Representation Learning

MGMAE: Motion Guided Masking for Video Masked Autoencoding

Text-Guided Video Masked Autoencoder

Motion Guided Token Compression for Efficient Masked Video Modeling

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos

VideoMAC: Video Masked Autoencoders Meet ConvNets

SurgMAE: Masked Autoencoders for Long Surgical Video Analysis

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

Concatenated Masked Autoencoders as Spatial-Temporal Learner

Masked Motion Encoding for Self-Supervised Video Representation Learning

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Extending Video Masked Autoencoders to 128 frames

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

MAR: M asked Autoencoders for Efficient A ction R ecognition

MAR: Masked Autoencoders for Efficient Action Recognition

Masked Autoencoders As Spatiotemporal Learners

EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens

MV2MAE: Multi-View Video Masked Autoencoders