Abstract:The emerging field of action prediction plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensionality, complex dynamics and uncertainties inherent in video data. Traditional supervised approaches require large amounts of labelled data, which is expensive and time-consuming to obtain. This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels). The Temporal-DINO approach employs two models; a 'student' processing past frames; and a 'teacher' processing both past and future frames, enabling a broader temporal context. During training, the teacher guides the student to learn future context by only observing past frames. The strategy is evaluated on ROAD dataset for the action prediction downstream task using 3D-ResNet, Transformer, and LSTM architectures. The experimental results showcase significant improvements in prediction performance across these architectures, with our method achieving an average enhancement of 9.9% Precision Points (PP), highlighting its effectiveness in enhancing the backbones' capabilities of capturing long-term dependencies. Furthermore, our approach demonstrates efficiency regarding the pretraining dataset size and the number of epochs required. This method overcomes limitations present in other approaches, including considering various backbone architectures, addressing multiple prediction horizons, reducing reliance on hand-crafted augmentations, and streamlining the pretraining process into a single stage. These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.

Temporal Context Consistency Above All: Enhancing Long-Term Anticipation by Learning and Enforcing Temporal Constraints

From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation

ContextDet: Temporal Action Detection with Adaptive Context Aggregation

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context.

Temporal Distinct Representation Learning for Action Recognition

Temporal Aggregate Representations for Long-Range Video Understanding

Leveraging Temporal Contextualization for Video Action Recognition

Introducing Gating and Context into Temporal Action Detection

Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction

Alignment-guided Temporal Attention for Video Action Recognition

Temporal Segment Transformer for Action Segmentation

How Much Temporal Long-Term Context is Needed for Action Segmentation?

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

TTPP: Temporal Transformer with Progressive Prediction for efficient action anticipation

Play and rewind: Context-aware video temporal action proposals

Self-attention-based Long Temporal Sequence Modeling Method for Temporal Action Detection

Efficient Action Detection in Untrimmed Videos via Multi-Task Learning

LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

Spatial–Temporal Context-Aware Online Action Detection and Prediction

An Empirical Study of End-to-End Temporal Action Detection

Intention-Conditioned Long-Term Human Egocentric Action Forecasting