Abstract:The emerging field of action prediction plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensionality, complex dynamics and uncertainties inherent in video data. Traditional supervised approaches require large amounts of labelled data, which is expensive and time-consuming to obtain. This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels). The Temporal-DINO approach employs two models; a 'student' processing past frames; and a 'teacher' processing both past and future frames, enabling a broader temporal context. During training, the teacher guides the student to learn future context by only observing past frames. The strategy is evaluated on ROAD dataset for the action prediction downstream task using 3D-ResNet, Transformer, and LSTM architectures. The experimental results showcase significant improvements in prediction performance across these architectures, with our method achieving an average enhancement of 9.9% Precision Points (PP), highlighting its effectiveness in enhancing the backbones' capabilities of capturing long-term dependencies. Furthermore, our approach demonstrates efficiency regarding the pretraining dataset size and the number of epochs required. This method overcomes limitations present in other approaches, including considering various backbone architectures, addressing multiple prediction horizons, reducing reliance on hand-crafted augmentations, and streamlining the pretraining process into a single stage. These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.

Self-Supervised Learning of Video Representation for Anticipating Actions in Early Stage

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Self-Regulated Learning for Egocentric Video Activity Anticipation

Temporally-Embedded Self-Supervised Video Representation Learning

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Self-Supervised Learning for Semi-Supervised Temporal Action Proposal

Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network

Self-supervised Temporal Discriminative Learning for Video Representation Learning

Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation

Semi-Supervised Multiple Feature Analysis for Action Recognition

Self-Supervised Representation Learning for Videos by Segmenting Via Sampling Rate Order Prediction

Self-Supervised Spatiotemporal Learning Via Video Clip Order Prediction.

Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision

Unsupervised Learning of View-invariant Action Representations

Learning to Anticipate Egocentric Actions by Imagination

TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition