Abstract:Tracking and segmenting multiple similar objects with complex or separate parts in long-term videos is inherently challenging due to the ambiguity of target parts and identity confusion caused by occlusion, background clutter, and long-term variations. In this paper, we propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic network comprising a semantic embedding block and spatial dependencies modeling block to associate the pretrained ViT features with global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation and ensuring effective long-term query propagation. The experimental results show that the proposed method set a new state-of-the-art performance on multiple datasets, including the DAVIS2017 test (89.1%), YoutubeVOS 2019 (88.5%), MOSE (75.1%), LVOS test (73.0%), and LVOS val (75.1%), which demonstrate the effectiveness and generalization capacity of the proposed method. We will make all source code and trained models publicly available.

What problem does this paper attempt to address?

This paper aims to address several key challenges in video object segmentation (VOS), especially when dealing with target objects with complex or separated parts. Specifically, the paper attempts to solve the following problems: 1. **Dramatic changes in target appearance**: In long - time videos, the target object may undergo significant appearance changes, which makes existing VOS methods difficult to maintain accurate tracking and segmentation. 2. **Occlusion and background clutter**: Occlusion and background clutter can lead to blurring and identity confusion of target parts, especially when dealing with multiple similar objects. 3. **Incomplete prediction of target parts**: Existing methods mainly rely on pixel - level correlations, which results in incomplete prediction masks when dealing with targets with complex structures or separated parts. 4. **Noise accumulation in query propagation**: Existing query - based methods often introduce noise and errors when updating target queries, leading to long - term tracking failures. To address these problems, the authors propose a new VOS framework that combines spatial - semantic features and discriminative object queries. Specific contributions are as follows: - **Spatial - semantic feature learning**: By constructing a spatial - semantic block containing a semantic embedding module and a spatial - dependency - modeling module, this framework can effectively utilize the global semantic information provided by the pre - trained ViT model and the local spatial information provided by the multi - scale CNN network to generate a comprehensive target representation. - **Discriminative query mechanism**: A discriminative query - propagation module has been developed, which can pay more attention to the representative parts of the target object, thereby reducing noise accumulation during query propagation and improving robustness in long - term videos. - **Extensive experimental verification**: Extensive experiments have been carried out on five benchmark datasets, including DAVIS 2017, YouTubeVOS 2018, YouTubeVOS 2019, LVOS and MOSE. The experimental results show that this method has achieved new state - of - the - art performance on all of these datasets. Through these innovations, this paper provides a more robust and efficient video object segmentation solution, especially suitable for dealing with objects with complex structures and long - term deformations.

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

LiDAR Video Object Segmentation with Dynamic Kernel Refinement

Self-supervised Video Object Segmentation Using Integration-Augmented Attention

Spatial-Temporal Multi-level Association for Video Object Segmentation

Robust Video Object Segmentation with Restricted Attention

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Scalable Video Object Segmentation with Identification Mechanism

Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation

Dual Temporal Memory Network for Efficient Video Object Segmentation

Towards Robust Video Object Segmentation with Adaptive Object Calibration

Motion-Guided Spatial Time Attention for Video Object Segmentation.

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Video object segmentation via couple streams and feature memory

Scalable Video Object Segmentation with Simplified Framework

Target Aware Adaptive Tracking for Unsupervised Video Object Segmentation

Towards Open-Vocabulary Video Semantic Segmentation

Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

Look Before You Match: Instance Understanding Matters in Video Object Segmentation

Training-Free Robust Interactive Video Object Segmentation