Abstract:Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal multi-level association framework, which jointly associates reference frame, test frame, and object features to achieve sufficient interaction and parallel target ID association with a spatial-temporal memory bank for efficient video object segmentation. Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features, which formulates feature extraction and interaction as the efficient operations of object self-attention, reference object enhancement, and test reference correlation. In addition, we propose a spatial-temporal memory to assist feature association and temporal ID assignment and correlation. We evaluate the proposed method by conducting extensive experiments on numerous video object segmentation datasets, including DAVIS 2016/2017 val, DAVIS 2017 test-dev, and YouTube-VOS 2018/2019 val. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of our approach. All source code and trained models will be made publicly available.

What problem does this paper attempt to address?

The paper mainly proposes improvements to address the issues in semi-supervised learning methods for Video Object Segmentation (VOS). Existing methods typically focus on temporal feature matching or spatiotemporal feature modeling but fail to adequately handle interactions between objects and efficient parallel processing, which limits the learning of dynamic, object-aware features. To solve the above problems, the paper proposes a Spatial-Temporal Multi-level Association (STMA) framework, which can jointly associate reference frames, test frames, and object features to achieve sufficient object interaction and parallel object ID association, and perform efficient video object segmentation through a spatiotemporal memory bank. Specifically, the framework includes: 1. **Spatial-Temporal Multi-level Feature Association Module**: A spatial-temporal multi-level feature association module is constructed to learn better object-aware features. This module formulates feature extraction and interaction as efficient operations of object self-attention, reference object enhancement, and test-reference correlation. 2. **Spatiotemporal Memory Bank**: A spatiotemporal memory bank is proposed to assist feature association and temporal ID assignment and association. This memory bank retains information related to different objects from previous frames, which is used to match, isolate, and enhance the features of each object in the test frame. 3. **Experimental Validation**: Extensive experimental validation was conducted on multiple public video object segmentation datasets, including DAVIS 2016/2017 and YouTube-VOS 2018/2019. The results show that the proposed algorithm has significant effectiveness, especially in challenging scenarios with small objects or long sequences. In summary, the main contribution of the paper is the proposal of a new spatial-temporal multi-level association method, which effectively addresses the limitations faced by existing technologies in video object segmentation tasks, particularly demonstrating excellent performance in handling small objects and long-term changes.

Spatial-Temporal Multi-level Association for Video Object Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation

Learning Quality-aware Dynamic Memory for Video Object Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Self-supervised Video Object Segmentation Using Integration-Augmented Attention

Learning Position and Target Consistency for Memory-based Video Object Segmentation

Dual Temporal Memory Network for Efficient Video Object Segmentation

Target Aware Adaptive Tracking for Unsupervised Video Object Segmentation

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Dual temporal memory network with high-order spatio-temporal graph learning for video object segmentation

Spatio-Temporal Video Segmentation of Static Scenes and Its Applications

Object-based spatial similarity for semi-supervised video object segmentation

Video Object Segmentation with Dynamic Memory Networks and Adaptive Object Alignment.

Beyond Appearance: Multi-Frame Spatio-Temporal Context Memory Networks for Efficient and Robust Video Object Segmentation

Video Object Segmentation Based on Multi-Level Target Models and Feature Integration

Temporo-Spatial Parallel Sparse Memory Networks for Efficient Video Object Segmentation

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

Video Object Segmentation using Space-Time Memory Networks

Video Object Segmentation with Weakly Temporal Information.