Discriminative Spatiotemporal Alignment for Self-Supervised Video Correspondence Learning

Qiaoqiao Wei,Hui Zhang,Jun-Hai Yong
DOI: https://doi.org/10.1109/ICME55011.2023.00316
2023-01-01
Abstract:This paper focuses on self-supervised video correspondence learning, which learns effective representations from raw videos without manual annotations and exploits the learned representations for video visual tracking tasks. Previous methods extract temporal correspondence between two frames in fixed geometric structures, which easily leads to mismatches of pixels and overlooks the intra-frame semantic correspondence. To address these issues, we propose a Discriminative Spatio-temporal Alignment (DSA) framework to improve the tracking accuracy in the inference stage. DSA first discriminates representations of different instances for each reference frame through an Instance-Guided Spatial Alignment (IGSA) module. Then, it employs a Focused Temporal Alignment (FTA) module, which samples discriminative pixels from reference frames and propagates the labels of the sampled reference pixels to a target pixel. Experimental results show that DSA possesses flexibility and generalizability and has boosted previous approaches on three tracking tasks, including video object segmentation, human part segmentation, and pose keypoint tracking.
What problem does this paper attempt to address?