Abstract:Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. The code is released at <a class="link-external link-https" href="https://github.com/shvdiwnkozbw/SMTC" rel="external noopener nofollow">this https URL</a>.

Learning Space-Time Semantic Correspondences

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence.

Temporal Tessellation: A Unified Approach for Video Analysis

Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images

Semantic-Aware Fine-Grained Correspondence

Joint-task Self-supervised Learning for Temporal Correspondence

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

Learning Semantic Correspondence Exploiting an Object-Level Prior

Boosting Video Object Segmentation Via Space-time Correspondence Learning

Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Improving Video Concept Detection Using Spatio-Temporal Correlation

Learning Fine-Grained Features for Pixel-wise Video Correspondences

End to End Alignment Learning of Instructional Videos with Spatiotemporal Hybrid Encoding and Decoding Space Reduction

Complementarity-Aware Space Learning for Video-Text Retrieval

Video Text Tracking With a Spatio-Temporal Complementary Model

Discriminative Spatiotemporal Alignment for Self-Supervised Video Correspondence Learning

Language-Aware Spatial-Temporal Collaboration for Referring Video Segmentation

A Novel Semantic Model for Video Concept Detection

On the Consensus of Synchronous Temporal and Spatial Views: A Novel Multimodal Deep Learning Method for Social Video Prediction

Semantic Correspondence as an Optimal Transport Problem