Abstract:This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order. Two key designs are included. First, we organize autoregressive video tokens into clusters that span both spatially and temporally, thereby enabling a richer aggregation of contextual information compared to the standard spatial-only or temporal-only clusters. Second, we adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example, when trained with the ViT-B backbone, ARVideo competitively attains 81.2% on Kinetics-400 and 70.9% on Something-Something V2, which are on par with the strong benchmark set by VideoMAE. Importantly, ARVideo also demonstrates higher training efficiency, i.e., it trains 14% faster and requires 58% less GPU memory compared to VideoMAE.
What problem does this paper attempt to address?
This paper attempts to address the problem of applying autoregressive pretraining methods in video representation learning to enhance the multidimensional modeling capability of video data. Specifically, the paper proposes a new framework called ARVideo, which aims to learn video representations by autoregressively predicting the next element in a video sequence. ARVideo overcomes the limitations of traditional single-dimensional (spatial or temporal) clustering methods by clustering video tokens into spatiotemporal clusters and adopting a random space-time prediction order.
### Main Contributions:
1. **Spatiotemporal Clustering**: ARVideo redefines "video elements" by clustering video tokens into spatiotemporal clusters instead of traditional single-dimensional (spatial or temporal) clustering. This allows the model to better capture contextual information and enhance the richness of semantic representations.
2. **Random Prediction Order**: Unlike traditional fixed orders (such as space-first or time-first), ARVideo adopts a random space-time prediction order, which can more effectively learn multidimensional data and capture the intrinsic multidimensional characteristics of video data.
3. **Performance Improvement**: Experimental results show that ARVideo achieves significant performance improvements on multiple benchmark datasets. For example, on the Kinetics-400 dataset, ARVideo achieved a Top-1 accuracy of 81.2%, which is 7% higher than the baseline method, and also demonstrated advantages in training efficiency, with a 14% increase in training speed and a 58% reduction in GPU memory consumption.
### Background and Motivation:
- **Success in Natural Language Processing**: Autoregressive models have achieved great success in natural language processing, especially in self-supervised learning on large-scale unlabeled data. However, the application of these methods to video data is relatively rare.
- **Complexity of Video Data**: Video data has both temporal and spatial dimensions, making the direct application of autoregressive models challenging. Traditional definitions of video elements (such as pixels or patches) often fail to capture rich semantic information.
- **Limitations of Existing Methods**: Existing self-supervised video representation learning methods mainly rely on masked modeling (such as VideoMAE), while autoregressive modeling has not been fully explored.
### Method Overview:
1. **Video Tokenization**: The video is divided into non-overlapping cubes and converted into video tokens through a linear projection layer, reducing computational demand and enhancing semantic representation.
2. **Spatiotemporal Clustering**: Adjacent video tokens are clustered into spatiotemporal clusters, where tokens within each cluster can fully attend to each other, better integrating semantic content.
3. **Random Prediction Order**: A random space-time prediction order is adopted to avoid the limitations of fixed orders, allowing the model to adapt to both long-range and short-range spatiotemporal information.
### Experimental Results:
- **Performance Comparison**: ARVideo achieves significant performance improvements on the Kinetics-400 and Something-Something V2 datasets, surpassing existing autoregressive methods and masked modeling methods.
- **Transfer Learning**: ARVideo also outperforms other methods in transfer learning on the AVA v2.2 and HMDB datasets, demonstrating strong feature transferability.
- **Computational Cost**: ARVideo shows advantages in training time and GPU memory consumption, resulting in lower training costs.
### Conclusion:
As a new autoregressive pretraining method, ARVideo successfully addresses the multidimensional modeling problem of video data, significantly enhancing the performance of video representation learning. This method not only achieves excellent results on multiple benchmark datasets but also demonstrates advantages in training efficiency, providing a new path for self-supervised video representation learning.