Abstract:Semi-supervised learning for video action recognition is a very challenging research area. Existing state-of-the-art methods perform data augmentation on the temporality of actions, which are combined with the mainstream consistency-based semi-supervised learning framework FixMatch for action recognition. However, these approaches have the following limitations: (1) data augmentation based on video clips lacks coarse-grained and fine-grained representations of actions in temporal sequences, and the models have difficulty understanding synonymous representations of actions in different motion phases. (2) Pseudo labeling selection based on the constant thresholds lacks a "make-up curriculum" for difficult actions, that results in the low utilization of unlabeled data corresponding to difficult actions. To address the above shortcomings, we propose a semi-supervised action recognition via the temporal augmentation using curriculum learning (TACL) algorithm. Compared to previous works, TACL explores different representations of the same semantics of actions in temporal sequences for video and uses the idea of curriculum learning (CL) to reduce the difficulty of the model training process. First, for different action expressions with the same semantics, we designed the temporal action augmentation (TAA) for videos to obtain coarse-grained and fine-grained action expressions based on constant-velocity and hetero-velocity methods, respectively. Second, we construct a temporal signal to constrain the model such that fine-grained action expressions containing different movement phases have the same prediction results, and achieve action consistency learning (ACL) by combining the label and pseudo-label signals. Finally, we propose action curriculum pseudo labeling (ACPL), a loosely and strictly parallel dynamic threshold evaluation algorithm for selecting and labeling unlabeled data. We evaluate TACL on three standard public datasets: U- F101, HMDB51, and Kinetics. The combined experiments show that TACL significantly improves the accuracy of models trained on a small amount of labeled data and better evaluates the learning effects for different actions.

Leveraging Frame- and Feature-Level Progressive Augmentation for Semi-supervised Action Recognition

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

Neighbor-Guided Consistent and Contrastive Learning for Semi-Supervised Action Recognition

An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video

PatchMix Augmentation to Identify Causal Features in Few-shot Learning

Semi-Supervised Action Recognition From Temporal Augmentation Using Curriculum Learning

Adaptive semi-supervised learning from stronger augmentation transformations of discrete text information

Attack-Augmentation Mixing-Contrastive Skeletal Representation Learning

Semi-supervised human action recognition via dual-stream cross-fusion and class-aware memory bank

Improving Self-Supervised Action Recognition from Extremely Augmented Skeleton Sequences

Adversarial Augmentation Training Makes Action Recognition Models More Robust to Realistic Video Distribution Shifts

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Learnable Feature Augmentation Framework for Temporal Action Localization

Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition

DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos

Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition

MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition

Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation

Learning Frame-Level Affinity with Video-Level Labels for Weakly Supervised Temporal Action Detection

Progressive Instance-Aware Feature Learning for Compositional Action Recognition.