Abstract:The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3. Code is available at

What problem does this paper attempt to address?

This paper aims to address the key challenges in the **Referring Video Object Segmentation (RVOS)** task. Specifically, the goal of RVOS is to segment and track target objects in a video according to a given natural - language description. The main difficulties in this task lie in pixel - level alignment between different modalities (visual and text) and between time steps, especially due to the diversity of video content and the unrestricted nature of language expressions. To solve these problems, the authors propose a **Two - Stage Multi - Model Fusion strategy**, which generates a high - quality binary mask sequence by integrating multiple current leading RVOS models and enhances the consistency and quality of the masks through different VOS models. The specific methods are as follows: 1. **First stage**: - Use multiple RVOS models (such as SOC, MUTR, Referformer, and UNINEXT) to generate an initial binary mask sequence. - Employ the AOT model as a post - processing step to improve the quality of the masks through an object propagation mechanism. 2. **Second stage**: - Utilize the DeAOT model to further enhance the quality of the masks, especially when dealing with long - term video sequences, to avoid information loss. - Fuse the high - quality masks generated in the first stage with the output of the UNINEXT model to fully utilize the advantages of different models. Through the above methods, the authors achieved a J&F score of 75.7% on the Ref - Youtube - VOS validation set and a J&F score of 70% on the test set, and finally ranked first in the RVOS track of the 5th Large - Scale Video Object Segmentation Challenge (ICCV 2023). ### Formula summary - **Mask generation**: \[ S_n = F_n(I, E)\quad n\in\{\text{soc}, \text{mutr}, \text{ref}, \text{uninext}\} \] where \( F_n \) represents the corresponding backbone model. - **Post - processing**: \[ K_{\text{index}}=\arg\max(P) \] \[ M_n = G(\{s_n^i\}_{i = 0}^{K_{\text{index}}}, \{s_n^j\}_{j = K_{\text{index}}}^T) \] where \( G \) represents the VOS model used for post - processing. - **Two - stage multi - model fusion**: \[ \hat{y}=\sum_{n = 1}^N\sum_{q = 1}^Q M_n^q \] \[ \hat{y}_i'= \begin{cases} 0 &\text{if }\hat{y}_i < \text{thr}\\ 1 &\text{if }\hat{y}_i\geq\text{thr} \end{cases} \] where \( Q \) represents the number of different text descriptions for the same object, \( N \) represents the number of models, \( i\in\{1, 2,\ldots, HW\}\), and \( H \) and \( W \) represent the height and width of the mask respectively. Through these methods, the authors successfully addressed the key challenges in the RVOS task and achieved significant performance improvements.

1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

TransVOS: Video Object Segmentation with Transformers

The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

Scalable Video Object Segmentation with Identification Mechanism

Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

Scalable Video Object Segmentation with Simplified Framework

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency

The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation

Towards Robust Video Object Segmentation with Adaptive Object Calibration

1st Place Solutions for the UVO Challenge 2022

Video Object Segmentation Based on Multi-Level Target Models and Feature Integration

Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning