Test-Time Training on Video Streams

Renhao Wang,Yu Sun,Yossi Gandelsman,Xinlei Chen,Alexei A. Efros,Xiaolong Wang
2023-07-12
Abstract:Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper primarily explores how to further enhance the performance of pre-trained models in video streams using Test-Time Training (TTT). Specifically, the paper attempts to address the following key issues: 1. **Extending TTT to video stream settings**: Previous work has established TTT as a general framework for further improving trained models through self-supervised tasks during testing. This paper extends the concept of TTT to video stream scenarios, where multiple test instances (video frames) arrive in temporal order. 2. **Online TTT method**: For each current frame in the video stream, the model is initialized from the previous frame's model and trained on the current frame and a few adjacent preceding frames. This method is called Online TTT. Online TTT significantly outperforms the fixed model baseline, with performance improvements of 45% and 66% on four tasks (for instance segmentation and panoptic segmentation tasks). 3. **Comparing online and offline TTT**: Offline TTT allows access to all frames of the entire test video, regardless of their temporal order. Surprisingly, even though Online TTT has access to limited information, its performance still surpasses that of Offline TTT. This finding differs from previous results obtained using synthetic videos. 4. **Exploring the advantage of locality**: The paper proposes the concept of "locality" as the advantage of Online TTT over Offline TTT and analyzes the role of locality through ablation experiments and a bias-variance trade-off based theoretical analysis. 5. **Experimental results**: The Online TTT method performs excellently on four tasks (semantic segmentation, instance segmentation, panoptic segmentation, and color restoration) across three real-world datasets. Notably, on the newly constructed dataset COCO Videos, the relative performance improvements reach 45% and 66% (for instance segmentation and panoptic segmentation). 6. **New dataset COCO Videos**: To better demonstrate the importance of locality, the paper also collects a new video dataset, COCO Videos, which contains much longer and more complex video clips than other public datasets. In summary, the main goal of this paper is to explore and optimize the test-time training technique in the context of video streams to improve the prediction quality of computer vision tasks.