Abstract:Semi-supervised learning for video action recognition is a very challenging research area. Existing state-of-the-art methods perform data augmentation on the temporality of actions, which are combined with the mainstream consistency-based semi-supervised learning framework FixMatch for action recognition. However, these approaches have the following limitations: (1) data augmentation based on video clips lacks coarse-grained and fine-grained representations of actions in temporal sequences, and the models have difficulty understanding synonymous representations of actions in different motion phases. (2) Pseudo labeling selection based on the constant thresholds lacks a "make-up curriculum" for difficult actions, that results in the low utilization of unlabeled data corresponding to difficult actions. To address the above shortcomings, we propose a semi-supervised action recognition via the temporal augmentation using curriculum learning (TACL) algorithm. Compared to previous works, TACL explores different representations of the same semantics of actions in temporal sequences for video and uses the idea of curriculum learning (CL) to reduce the difficulty of the model training process. First, for different action expressions with the same semantics, we designed the temporal action augmentation (TAA) for videos to obtain coarse-grained and fine-grained action expressions based on constant-velocity and hetero-velocity methods, respectively. Second, we construct a temporal signal to constrain the model such that fine-grained action expressions containing different movement phases have the same prediction results, and achieve action consistency learning (ACL) by combining the label and pseudo-label signals. Finally, we propose action curriculum pseudo labeling (ACPL), a loosely and strictly parallel dynamic threshold evaluation algorithm for selecting and labeling unlabeled data. We evaluate TACL on three standard public datasets: U- F101, HMDB51, and Kinetics. The combined experiments show that TACL significantly improves the accuracy of models trained on a small amount of labeled data and better evaluates the learning effects for different actions.

CTDA: Contrastive Temporal Domain Adaptation for Action Segmentation

Spatio-temporal Contrastive Domain Adaptation for Action Recognition

SMC-NCA: Semantic-guided Multi-level Contrast for Semi-supervised Temporal Action Segmentation

Temporal Segment Transformer for Action Segmentation

Temporal Action Segmentation with High-level Complex Activity Labels

C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action Segmentation

ContextDet: Temporal Action Detection with Adaptive Context Aggregation

TODO-Net: Temporally Observed Domain Contrastive Network for 3-D Early Action Prediction

Learning Discriminative Spatio-temporal Representations for Semi-supervised Action Recognition

Involving Distinguished Temporal Graph Convolutional Networks for Skeleton-Based Temporal Action Segmentation

Contrastive Learning and Self-Training for Unsupervised Domain Adaptation in Semantic Segmentation

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition

Semi-Supervised Action Recognition From Temporal Augmentation Using Curriculum Learning

DIR-AS: Decoupling Individual Identification and Temporal Reasoning for Action Segmentation

Channel-Temporal Attention for First-Person Video Domain Adaptation

Non-Local Temporal Difference Network for Temporal Action Detection

Temporal Action Localization with Enhanced Instant Discriminability

Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation

Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video Grounding

DA-STC: Domain Adaptive Video Semantic Segmentation via Spatio-Temporal Consistency.