Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Yang Qian,Yinan Sun,Ali Kargarandehkordi,Parnian Azizian,Onur Cezmi Mutlu,Saimourya Surabhi,Pingyi Chen,Zain Jabbar,Dennis Paul Wall,Peter Washington
2024-07-16
Abstract:The increasing variety and quantity of tagged multimedia content on a variety of online platforms offer a unique opportunity to advance the field of human action recognition. In this study, we utilize 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. We employ VideoMAE V2, an advanced model integrating Masked Autoencoders (MAE) with Vision Transformers (ViT), pre-trained on this diverse collection of unstructured videos. Our model, fine-tuned on established action recognition benchmarks such as UCF101 and HMDB51, achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone. These results highlight the potential of using unstructured and unlabeled videos as a valuable source of diverse and dynamic content for training foundation models. Our investigation confirms that while initial increases in pre-training data volume significantly enhance model performance, the gains diminish as the dataset size continues to expand. Our findings emphasize two critical axioms in self-supervised learning for computer vision: (1) additional pre-training data can yield diminishing benefits for some datasets and (2) quality is more important than quantity in self-supervised learning, especially when building foundation models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of human action recognition. Specifically, the author trained a domain - specific base model using a large amount of unlabeled TikTok video data to enhance the performance on action recognition tasks. The paper mentions that although existing action recognition datasets are diverse, they often lack the dynamics and cultural diversity in real - world scenarios and are difficult to capture some non - standard human behaviors. By using 283,582 unique, unlabeled TikTok video clips, the author constructed a dataset named TikTokActions, aiming to reflect a broader range of real - world activities and cultural diversity and fill the gaps in existing datasets. The main contributions of the paper are as follows: 1. **Innovation of the dataset**: The TikTokActions dataset not only contains a large number of video clips but also covers a variety of unique, non - standard human behaviors that are less common in traditional action recognition datasets. 2. **Improvement of model performance**: By pre - training the VideoMAE V2 model on the TikTokActions dataset and fine - tuning it on benchmark datasets such as UCF101, HMDB51, Kinetics - 400, and Something - Something V2, the model has achieved state - of - the - art performance on multiple metrics. 3. **Relationship between data volume and performance**: The impact of the amount of pre - training data on the performance of downstream tasks was studied, and it was found that as the amount of data increases, the improvement in model performance gradually weakens, indicating that in some cases, the quality of data is more important than the quantity. Overall, this paper shows how to use unlabeled social media videos to train base models by constructing and using the TikTokActions dataset, thereby achieving a significant performance improvement in action recognition tasks.