Abstract:Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at <a class="link-external link-https" href="https://github.com/bo-miao/HTR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the Referring Video Object Segmentation (R - VOS) task, how to maintain the temporal consistency of object segmentation. Specifically, R - VOS methods have difficulty maintaining consistent object segmentation when facing changes in temporal context and the presence of other visually similar objects. Therefore, the authors propose an end - to - end R - VOS paradigm to solve these problems by explicitly modeling temporal instance consistency. ### Specific manifestations of the problem 1. **Changes in temporal context**: Objects in the video may change due to factors such as background, lighting, and occlusion, resulting in unstable segmentation results. 2. **Interference from visually similar objects**: When there are multiple objects with similar appearances in the video, the model is prone to confusing the target object, leading to segmentation errors. ### Solutions To solve the above problems, the authors propose the following innovations: 1. **Hybrid Memory Module**: - A new hybrid memory module is introduced, which combines local memory and global tokens to achieve robust spatio - temporal matching and propagation. - The local memory is used to capture pixel - level context, ensuring fine - grained feature matching and propagation. - The global tokens extract lightweight representations of the foreground and background, providing crucial global spatio - temporal context, helping to locate the target and reduce error propagation. 2. **Inter - frame Collaboration**: - Utilize the features of high - quality reference frames and propagate these features to the remaining frames through multi - granularity association, thereby achieving temporally consistent R - VOS. 3. **Mask Consistency Score (MCS)**: - A new evaluation metric, the Mask Consistency Score (MCS), is proposed to evaluate the temporal consistency of video segmentation. - MCS measures the consistency of the entire video by calculating whether the segmentation accuracy of all frames exceeds a specified threshold. ### Experimental results Experiments show that the HTR model proposed by the authors achieves significantly better performance than existing methods on multiple popular R - VOS benchmark datasets, especially with a significant improvement in temporal consistency. For example, on the Ref - YouTube - VOS and Ref - DAVIS17 datasets, the HTR model respectively reaches J&F scores of 67.1% and 65.6%, and also has a significant improvement in MCS@0.9. ### Summary The main contribution of this paper lies in proposing an end - to - end R - VOS paradigm. By introducing the hybrid memory module and the inter - frame collaboration mechanism, it effectively solves the temporal consistency problem and proposes a new evaluation metric to measure the temporal consistency of video segmentation.

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Learning Quality-aware Dynamic Memory for Video Object Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

LiDAR Video Object Segmentation with Dynamic Kernel Refinement

Dual Temporal Memory Network for Efficient Video Object Segmentation

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Region Aware Video Object Segmentation With Deep Motion Modeling

Robust and Efficient Memory Network for Video Object Segmentation

Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing

RMem: Restricted Memory Banks Improve Video Object Segmentation

Space-time Reinforcement Network for Video Object Segmentation

Motion-Guided Spatial Time Attention for Video Object Segmentation.

Efficient Regional Memory Network for Video Object Segmentation

Recurrent Dynamic Embedding for Video Object Segmentation

Towards Robust Video Object Segmentation with Adaptive Object Calibration

Video Object Segmentation with Dynamic Memory Networks and Adaptive Object Alignment.

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

Adaptive Selection of Reference Frames for Video Object Segmentation.

Dual temporal memory network with high-order spatio-temporal graph learning for video object segmentation