Scaling 4D Representations

João Carreira,Dilara Gokay,Michael King,Chuhan Zhang,Ignacio Rocco,Aravindh Mahendran,Thomas Albert Keck,Joseph Heyward,Skanda Koppula,Etienne Pot,Goker Erdogan,Yana Hasson,Yi Yang,Klaus Greff,Guillaume Le Moing,Sjoerd van Steenkiste,Daniel Zoran,Drew A. Hudson,Pedro Vélez,Luisa Polanía,Luke Friedman,Chris Duvarney,Ross Goroshin,Kelsey Allen,Jacob Walker,Rishabh Kabra,Eric Aboussouan,Jennifer Sun,Thomas Kipf,Carl Doersch,Viorica Pătrăucean,Dima Damen,Pauline Luc,Mehdi S. M. Sajjadi,Andrew Zisserman
2024-12-20
Abstract:Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to verify the scalability of pure self - supervised learning on video data, especially its performance in handling non - semantic visual tasks. Specifically, the research team focuses on tasks that require strong spatial (3D) and temporal (+1D = 4D) understanding abilities, such as camera pose estimation, point and object tracking, and depth estimation. By training with very large video datasets and using Masked Auto - encoding (MAE) and Transformer video models, the researchers hope to prove that as the model scale increases (from 20M parameters to the largest 22B parameters), the performance can continue to improve. ### Summary of the Main Problems in the Paper 1. **Effectiveness of Self - supervised Learning**: - Can the effect of self - supervised learning be improved by expanding the model scale and the amount of data? - Can self - supervised learning achieve excellent performance in complex 4D tasks without the need for language supervision? 2. **Model Comparison**: - Compare the performance of different types of pre - trained models (including image models and video models) on these 4D tasks. - Explore the differences between language supervision and video self - supervision, and evaluate which method is more suitable for these tasks. 3. **Model Expansion**: - Study the expansion characteristics of MAE on large - scale video datasets, especially its performance when the number of model parameters expands from 20M to 22B. - Propose a new decoding scheme to improve the training efficiency of large models. ### Key Findings - **Advantages of Self - supervised Learning**: Research shows that self - supervised models (such as MAE) trained with large - scale video datasets perform well on 4D tasks, especially when the model scale increases, the performance improves significantly. - **Language Supervision vs. Video Self - supervision**: Language supervision may perform better in some classification tasks, but in tasks involving spatio - temporal understanding, video self - supervision shows stronger capabilities. - **Model Expansion Effect**: When the MAE model is expanded to 22B parameters, it can still maintain a good performance improvement, breaking the previous view that MAE has poor scalability. ### Formula Presentation To show the technical details in the research more clearly, the following are some key formulas mentioned in the paper: - **Masked Auto - encoding (MAE) Loss Function**: \[ \mathcal{L}_{\text{MAE}}=\frac{1}{N} \sum_{i = 1}^{N}\|x_i-\hat{x}_i\|^2 \] where \(x_i\) is the original input video frame, \(\hat{x}_i\) is the reconstructed video frame, and \(N\) is the number of masked patches. - **Positional Embedding**: \[ X_{\text{pos}} = X + P \] where \(X\) is the input feature and \(P\) is the positional embedding matrix. Through these formulas and experimental results, the paper shows the potential of large - scale self - supervised learning in 4D visual tasks and provides an important reference for future research.