Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Yang Qian,Yinan Sun,Ali Kargarandehkordi,Parnian Azizian,Onur Cezmi Mutlu,Saimourya Surabhi,Pingyi Chen,Zain Jabbar,Dennis Paul Wall,Peter Washington

2024-07-16

Abstract:The increasing variety and quantity of tagged multimedia content on a variety of online platforms offer a unique opportunity to advance the field of human action recognition. In this study, we utilize 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. We employ VideoMAE V2, an advanced model integrating Masked Autoencoders (MAE) with Vision Transformers (ViT), pre-trained on this diverse collection of unstructured videos. Our model, fine-tuned on established action recognition benchmarks such as UCF101 and HMDB51, achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone. These results highlight the potential of using unstructured and unlabeled videos as a valuable source of diverse and dynamic content for training foundation models. Our investigation confirms that while initial increases in pre-training data volume significantly enhance model performance, the gains diminish as the dataset size continues to expand. Our findings emphasize two critical axioms in self-supervised learning for computer vision: (1) additional pre-training data can yield diminishing benefits for some datasets and (2) quality is more important than quantity in self-supervised learning, especially when building foundation models.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the performance of human action recognition. Specifically, the author trained a domain - specific base model using a large amount of unlabeled TikTok video data to enhance the performance on action recognition tasks. The paper mentions that although existing action recognition datasets are diverse, they often lack the dynamics and cultural diversity in real - world scenarios and are difficult to capture some non - standard human behaviors. By using 283,582 unique, unlabeled TikTok video clips, the author constructed a dataset named TikTokActions, aiming to reflect a broader range of real - world activities and cultural diversity and fill the gaps in existing datasets. The main contributions of the paper are as follows: 1. **Innovation of the dataset**: The TikTokActions dataset not only contains a large number of video clips but also covers a variety of unique, non - standard human behaviors that are less common in traditional action recognition datasets. 2. **Improvement of model performance**: By pre - training the VideoMAE V2 model on the TikTokActions dataset and fine - tuning it on benchmark datasets such as UCF101, HMDB51, Kinetics - 400, and Something - Something V2, the model has achieved state - of - the - art performance on multiple metrics. 3. **Relationship between data volume and performance**: The impact of the amount of pre - training data on the performance of downstream tasks was studied, and it was found that as the amount of data increases, the improvement in model performance gradually weakens, indicating that in some cases, the quality of data is more important than the quantity. Overall, this paper shows how to use unlabeled social media videos to train base models by constructing and using the TikTokActions dataset, thereby achieving a significant performance improvement in action recognition tasks.

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Action Recognition by Exploring Data Distribution and Feature Correlation

Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models

Semi-Supervised Multiple Feature Analysis for Action Recognition

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

AIM: Adapting Image Models for Efficient Video Action Recognition

Annotation-Efficient Untrimmed Video Action Recognition

Representing Videos As Discriminative Sub-graphs for Action Recognition*

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Learnable Feature Augmentation Framework for Temporal Action Localization

Action Recognition by Hierarchical Mid-level Action Elements

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

SVFormer: Semi-supervised Video Transformer for Action Recognition

Learning Hierarchical Video Representation for Action Recognition

Semi-Supervised Action Recognition From Temporal Augmentation Using Curriculum Learning

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos