1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation

Zhuoyan Luo,Yicheng Xiao,Yong Liu,Yitong Wang,Yansong Tang,Xiu Li,Yujiu Yang
2024-01-01
Abstract:The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3. Code is available at
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to address the key challenges in the **Referring Video Object Segmentation (RVOS)** task. Specifically, the goal of RVOS is to segment and track target objects in a video according to a given natural - language description. The main difficulties in this task lie in pixel - level alignment between different modalities (visual and text) and between time steps, especially due to the diversity of video content and the unrestricted nature of language expressions. To solve these problems, the authors propose a **Two - Stage Multi - Model Fusion strategy**, which generates a high - quality binary mask sequence by integrating multiple current leading RVOS models and enhances the consistency and quality of the masks through different VOS models. The specific methods are as follows: 1. **First stage**: - Use multiple RVOS models (such as SOC, MUTR, Referformer, and UNINEXT) to generate an initial binary mask sequence. - Employ the AOT model as a post - processing step to improve the quality of the masks through an object propagation mechanism. 2. **Second stage**: - Utilize the DeAOT model to further enhance the quality of the masks, especially when dealing with long - term video sequences, to avoid information loss. - Fuse the high - quality masks generated in the first stage with the output of the UNINEXT model to fully utilize the advantages of different models. Through the above methods, the authors achieved a J&F score of 75.7% on the Ref - Youtube - VOS validation set and a J&F score of 70% on the test set, and finally ranked first in the RVOS track of the 5th Large - Scale Video Object Segmentation Challenge (ICCV 2023). ### Formula summary - **Mask generation**: \[ S_n = F_n(I, E)\quad n\in\{\text{soc}, \text{mutr}, \text{ref}, \text{uninext}\} \] where \( F_n \) represents the corresponding backbone model. - **Post - processing**: \[ K_{\text{index}}=\arg\max(P) \] \[ M_n = G(\{s_n^i\}_{i = 0}^{K_{\text{index}}}, \{s_n^j\}_{j = K_{\text{index}}}^T) \] where \( G \) represents the VOS model used for post - processing. - **Two - stage multi - model fusion**: \[ \hat{y}=\sum_{n = 1}^N\sum_{q = 1}^Q M_n^q \] \[ \hat{y}_i'= \begin{cases} 0 &\text{if }\hat{y}_i < \text{thr}\\ 1 &\text{if }\hat{y}_i\geq\text{thr} \end{cases} \] where \( Q \) represents the number of different text descriptions for the same object, \( N \) represents the number of models, \( i\in\{1, 2,\ldots, HW\}\), and \( H \) and \( W \) represent the height and width of the mask respectively. Through these methods, the authors successfully addressed the key challenges in the RVOS task and achieved significant performance improvements.