Abstract:Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to verify the scalability of pure self - supervised learning on video data, especially its performance in handling non - semantic visual tasks. Specifically, the research team focuses on tasks that require strong spatial (3D) and temporal (+1D = 4D) understanding abilities, such as camera pose estimation, point and object tracking, and depth estimation. By training with very large video datasets and using Masked Auto - encoding (MAE) and Transformer video models, the researchers hope to prove that as the model scale increases (from 20M parameters to the largest 22B parameters), the performance can continue to improve. ### Summary of the Main Problems in the Paper 1. **Effectiveness of Self - supervised Learning**: - Can the effect of self - supervised learning be improved by expanding the model scale and the amount of data? - Can self - supervised learning achieve excellent performance in complex 4D tasks without the need for language supervision? 2. **Model Comparison**: - Compare the performance of different types of pre - trained models (including image models and video models) on these 4D tasks. - Explore the differences between language supervision and video self - supervision, and evaluate which method is more suitable for these tasks. 3. **Model Expansion**: - Study the expansion characteristics of MAE on large - scale video datasets, especially its performance when the number of model parameters expands from 20M to 22B. - Propose a new decoding scheme to improve the training efficiency of large models. ### Key Findings - **Advantages of Self - supervised Learning**: Research shows that self - supervised models (such as MAE) trained with large - scale video datasets perform well on 4D tasks, especially when the model scale increases, the performance improves significantly. - **Language Supervision vs. Video Self - supervision**: Language supervision may perform better in some classification tasks, but in tasks involving spatio - temporal understanding, video self - supervision shows stronger capabilities. - **Model Expansion Effect**: When the MAE model is expanded to 22B parameters, it can still maintain a good performance improvement, breaking the previous view that MAE has poor scalability. ### Formula Presentation To show the technical details in the research more clearly, the following are some key formulas mentioned in the paper: - **Masked Auto - encoding (MAE) Loss Function**: \[ \mathcal{L}_{\text{MAE}}=\frac{1}{N} \sum_{i = 1}^{N}\|x_i-\hat{x}_i\|^2 \] where $x_i$ is the original input video frame, $\hat{x}_i$ is the reconstructed video frame, and $N$ is the number of masked patches. - **Positional Embedding**: \[ X_{\text{pos}} = X + P \] where $X$ is the input feature and $P$ is the positional embedding matrix. Through these formulas and experimental results, the paper shows the potential of large - scale self - supervised learning in 4D visual tasks and provides an important reference for future research.

Scaling 4D Representations

Scaling Video Analytics Systems to Large Camera Deployments

Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience

Scaling and Benchmarking Self-Supervised Visual Representation Learning

Spatio-Temporal Crop Aggregation for Video Representation Learning

Temporally-Embedded Self-Supervised Video Representation Learning

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

DistInit: Learning Video Representations Without a Single Labeled Video

Masked Autoencoders Are Scalable Vision Learners

Scaling Autoregressive Video Models

Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

Long-Short Temporal Contrastive Learning of Video Transformers

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

When Do We Not Need Larger Vision Models?

Self-supervised Monocular Depth and Visual Odometry Learning with Scale-consistent Geometric Constraints

An Unsupervised Monocular Visual Odometry Based on Multi-Scale Modeling

Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.

Unsupervised Scale-Consistent Depth Learning from Video

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking