Abstract:Spatio-Temporal Video Grounding (STVG) aims at localizing the spatio-temporal tube of a specific object in an untrimmed video given a free-form natural language query. As the annotation of tubes is labor intensive, researchers are motivated to explore weakly supervised approaches in recent works, which usually results in significant performance degradation. To achieve a less expensive STVG method with acceptable accuracy, this work investigates the "single-frame supervision" paradigm that requires a single frame labeled with a bounding box within the temporal boundary of the fully supervised counterpart as the supervisory signal. Based on the characteristics of the STVG problem, we propose a Two-Stage Multiple Instance Learning (T-SMILE) method, which creates pseudo labels by expanding the annotated frame to its contextual frames, thereby establishing a fully-supervised problem to facilitate further model training. The innovations of the proposed method are three-folded, including 1) utilizing multiple instance learning to dynamically select instances in positive bags for the recognition of starting and ending timestamps, 2) learning highly discriminative query features by incorporating spatial prior constraints in cross-attention, and 3) designing a curriculum learning-based strategy that iterative assigns dynamic weights to spatial and temporal branches, thereby gradually adapting to the learning branch with larger difficulty. To facilitate future research on this task, we also contribute a large-scale benchmark containing 12,469 videos on complex scenes with single-frame annotation. The extensive experiments on two benchmarks demonstrate that T-SMILE significantly outperforms all weakly-supervised methods. Remarkably, it also performs better than some fully-supervised methods associated with much more annotation labor costs. The dataset and codes are available at https://github.com/qumengxue/T-SMILE.

Surveillance Video Parsing with Single Frame Supervision

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Spatio-Temporal Video Segmentation of Static Scenes and Its Applications

Video Scene Graph Generation from Single-Frame Weak Supervision.

Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes

Real-time spatiotemporal segmentation of video objects in the H.264 compressed domain

Single-Frame Supervision for Spatio-Temporal Video Grounding

A Representative-Based Framework For Parsing And Summarizing Events In Surveillance Videos

Temporal Pixel-Level Semantic Understanding Through the VSPW Dataset

Weakly Supervised Video Salient Object Detection via Point Supervision

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

Video Frame Prediction from a Single Image and Events

SpVOS: Efficient Video Object Segmentation With Triple Sparse Convolution

VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

Unsupervised video forecasting with flow parsing mechanism of human visual system

Video Scene Parsing: an Overview of Deep Learning Methods and Datasets

Intelligent Analysis Oriented Surveillance Video Coding.

Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Fine-Grained Human-Centric Tracklet Segmentation with Single Frame Supervision

Mask Propagation for Efficient Video Semantic Segmentation