Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Bo Miao,Mohammed Bennamoun,Yongsheng Gao,Mubarak Shah,Ajmal Mian
2024-10-11
Abstract:Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at <a class="link-external link-https" href="https://github.com/bo-miao/HTR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the Referring Video Object Segmentation (R - VOS) task, how to maintain the temporal consistency of object segmentation. Specifically, R - VOS methods have difficulty maintaining consistent object segmentation when facing changes in temporal context and the presence of other visually similar objects. Therefore, the authors propose an end - to - end R - VOS paradigm to solve these problems by explicitly modeling temporal instance consistency. ### Specific manifestations of the problem 1. **Changes in temporal context**: Objects in the video may change due to factors such as background, lighting, and occlusion, resulting in unstable segmentation results. 2. **Interference from visually similar objects**: When there are multiple objects with similar appearances in the video, the model is prone to confusing the target object, leading to segmentation errors. ### Solutions To solve the above problems, the authors propose the following innovations: 1. **Hybrid Memory Module**: - A new hybrid memory module is introduced, which combines local memory and global tokens to achieve robust spatio - temporal matching and propagation. - The local memory is used to capture pixel - level context, ensuring fine - grained feature matching and propagation. - The global tokens extract lightweight representations of the foreground and background, providing crucial global spatio - temporal context, helping to locate the target and reduce error propagation. 2. **Inter - frame Collaboration**: - Utilize the features of high - quality reference frames and propagate these features to the remaining frames through multi - granularity association, thereby achieving temporally consistent R - VOS. 3. **Mask Consistency Score (MCS)**: - A new evaluation metric, the Mask Consistency Score (MCS), is proposed to evaluate the temporal consistency of video segmentation. - MCS measures the consistency of the entire video by calculating whether the segmentation accuracy of all frames exceeds a specified threshold. ### Experimental results Experiments show that the HTR model proposed by the authors achieves significantly better performance than existing methods on multiple popular R - VOS benchmark datasets, especially with a significant improvement in temporal consistency. For example, on the Ref - YouTube - VOS and Ref - DAVIS17 datasets, the HTR model respectively reaches J&F scores of 67.1% and 65.6%, and also has a significant improvement in MCS@0.9. ### Summary The main contribution of this paper lies in proposing an end - to - end R - VOS paradigm. By introducing the hybrid memory module and the inter - frame collaboration mechanism, it effectively solves the temporal consistency problem and proposes a new evaluation metric to measure the temporal consistency of video segmentation.