Abstract:Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.

What problem does this paper attempt to address?

The paper primarily explores how to further enhance the performance of pre-trained models in video streams using Test-Time Training (TTT). Specifically, the paper attempts to address the following key issues: 1. **Extending TTT to video stream settings**: Previous work has established TTT as a general framework for further improving trained models through self-supervised tasks during testing. This paper extends the concept of TTT to video stream scenarios, where multiple test instances (video frames) arrive in temporal order. 2. **Online TTT method**: For each current frame in the video stream, the model is initialized from the previous frame's model and trained on the current frame and a few adjacent preceding frames. This method is called Online TTT. Online TTT significantly outperforms the fixed model baseline, with performance improvements of 45% and 66% on four tasks (for instance segmentation and panoptic segmentation tasks). 3. **Comparing online and offline TTT**: Offline TTT allows access to all frames of the entire test video, regardless of their temporal order. Surprisingly, even though Online TTT has access to limited information, its performance still surpasses that of Offline TTT. This finding differs from previous results obtained using synthetic videos. 4. **Exploring the advantage of locality**: The paper proposes the concept of "locality" as the advantage of Online TTT over Offline TTT and analyzes the role of locality through ablation experiments and a bias-variance trade-off based theoretical analysis. 5. **Experimental results**: The Online TTT method performs excellently on four tasks (semantic segmentation, instance segmentation, panoptic segmentation, and color restoration) across three real-world datasets. Notably, on the newly constructed dataset COCO Videos, the relative performance improvements reach 45% and 66% (for instance segmentation and panoptic segmentation). 6. **New dataset COCO Videos**: To better demonstrate the importance of locality, the paper also collects a new video dataset, COCO Videos, which contains much longer and more complex video clips than other public datasets. In summary, the main goal of this paper is to explore and optimize the test-time training technique in the context of video streams to improve the prediction quality of computer vision tasks.

Test-Time Training on Video Streams

Learning to Adapt to Online Streams with Distribution Shifts

Exploring Motion Cues for Video Test-Time Adaptation

ReC-TTT: Contrastive Feature Reconstruction for Test-Time Training

Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering Regularized Self-Training

Test Time Learning for Time Series Forecasting

CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training

NC-TTT: A Noise Contrastive Approach for Test-Time Training

Video Test-Time Adaptation for Action Recognition

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Robust Test-Time Adaptation in Dynamic Scenarios

TTVD: Towards a Geometric Framework for Test-Time Adaptation Based on Voronoi Diagram

Learning from One Continuous Video Stream

Bag of Tricks for Fully Test-Time Adaptation

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Real-time Online Video Detection with Temporal Smoothing Transformers

TTT4Rec: A Test-Time Training Approach for Rapid Adaption in Sequential Recommendation