Abstract:The emerging field of action prediction plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensionality, complex dynamics and uncertainties inherent in video data. Traditional supervised approaches require large amounts of labelled data, which is expensive and time-consuming to obtain. This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels). The Temporal-DINO approach employs two models; a 'student' processing past frames; and a 'teacher' processing both past and future frames, enabling a broader temporal context. During training, the teacher guides the student to learn future context by only observing past frames. The strategy is evaluated on ROAD dataset for the action prediction downstream task using 3D-ResNet, Transformer, and LSTM architectures. The experimental results showcase significant improvements in prediction performance across these architectures, with our method achieving an average enhancement of 9.9% Precision Points (PP), highlighting its effectiveness in enhancing the backbones' capabilities of capturing long-term dependencies. Furthermore, our approach demonstrates efficiency regarding the pretraining dataset size and the number of epochs required. This method overcomes limitations present in other approaches, including considering various backbone architectures, addressing multiple prediction horizons, reducing reliance on hand-crafted augmentations, and streamlining the pretraining process into a single stage. These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.

$\texttt{DINO-Foresight}$: Looking into the Future with DINO

Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction

Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation

Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency.

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Change Dino: A Unified Transformer-Based Framework for Object-Level Change Detection and Segmentation in Remote Sensing Imagery

Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search

Vision Transformers for Dense Prediction

Single Level Feature-to-Feature Forecasting with Deformable Convolutions

UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving

Deep ViT Features as Dense Visual Descriptors

DINOv2: Learning Robust Visual Features without Supervision

Temporal Fusion Transformers for interpretable multi-horizon time series forecasting

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection

DIVE: Taming DINO for Subject-Driven Video Editing