Referring Video Object Segmentation via Language-aligned Track Selection

Seongchan Kim,Woojeong Jin,Sangbeom Lim,Heeji Yoon,Hyunwook Choi,Seungryong Kim
2024-12-02
Abstract:Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at <a class="link-external link-https" href="https://cvlab-kaist.github.io/SOLA" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that **in Referring Video Object Segmentation (RVOS) based on natural - language expressions, existing methods have made progress in visual - language alignment, but often overlook the importance of robust video object tracking**. Specifically, inconsistent mask trajectories may disrupt visual - language alignment, resulting in poor performance. To address this issue, the authors propose a new framework **Selection by Object Language Alignment (SOLA)**, which redefines the RVOS task as two sub - problems: trajectory generation and trajectory selection. Through this method, SOLA ensures high - quality mask trajectories and focuses on visual - language alignment, thereby improving overall performance. ### Main Contributions 1. **Redefine the RVOS task**: Decompose the RVOS task into two sub - problems, trajectory generation and trajectory selection, and ensure high - quality mask trajectories by using Segment Anything Model 2 (SAM2). 2. **Introduce a lightweight language - aligned trajectory selection module**: This module effectively utilizes the visual and language representations obtained from the frozen model to achieve motion modeling and visual - language alignment. 3. **Achieve the best results on multiple benchmark datasets**: SOLA performs excellently on datasets such as MeViS, Ref - YouTube - VOS, and Ref - DAVIS, especially demonstrating strong generalization ability and robustness in the zero - shot setting. ### Method Overview The main processes of SOLA include: - **Trajectory Generation**: Use SAM2 to generate consistent mask trajectories to ensure reliable foreground and background object candidates. - **Trajectory Selection**: Through a lightweight selection module, combine visual and text features to select the trajectory that best matches the given language expression. ### Experimental Results Experiments show that SOLA significantly outperforms existing methods on multiple RVOS benchmark datasets, especially when dealing with complex scenes and noisy data. ### Formula Summary Some of the key formulas involved in the paper are as follows: - **Alignment Score Calculation**: \[ O_a, s_a = TS(O; T) \] where \(O_a\) is the aligned object label and \(s_a\) is the alignment score. - **Weighted Sum Calculation**: \[ w_a=\text{softmax}(\text{Avg}(O'\otimes T)) \] \[ O_a = w_a\otimes O' \] \[ s_a=\sigma(\text{Avg}(O'\otimes T)) \] - **Loss Function**: \[ L = \lambda_1L_{\text{BCE}}+\lambda_2L_{\text{align}} \] where \(L_{\text{BCE}}\) is the binary cross - entropy loss and \(L_{\text{align}}\) is the alignment loss. Through these improvements, SOLA shows excellent performance and robustness in the RVOS task.