Abstract:Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at <a class="link-external link-https" href="https://cvlab-kaist.github.io/SOLA" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that **in Referring Video Object Segmentation (RVOS) based on natural - language expressions, existing methods have made progress in visual - language alignment, but often overlook the importance of robust video object tracking**. Specifically, inconsistent mask trajectories may disrupt visual - language alignment, resulting in poor performance. To address this issue, the authors propose a new framework **Selection by Object Language Alignment (SOLA)**, which redefines the RVOS task as two sub - problems: trajectory generation and trajectory selection. Through this method, SOLA ensures high - quality mask trajectories and focuses on visual - language alignment, thereby improving overall performance. ### Main Contributions 1. **Redefine the RVOS task**: Decompose the RVOS task into two sub - problems, trajectory generation and trajectory selection, and ensure high - quality mask trajectories by using Segment Anything Model 2 (SAM2). 2. **Introduce a lightweight language - aligned trajectory selection module**: This module effectively utilizes the visual and language representations obtained from the frozen model to achieve motion modeling and visual - language alignment. 3. **Achieve the best results on multiple benchmark datasets**: SOLA performs excellently on datasets such as MeViS, Ref - YouTube - VOS, and Ref - DAVIS, especially demonstrating strong generalization ability and robustness in the zero - shot setting. ### Method Overview The main processes of SOLA include: - **Trajectory Generation**: Use SAM2 to generate consistent mask trajectories to ensure reliable foreground and background object candidates. - **Trajectory Selection**: Through a lightweight selection module, combine visual and text features to select the trajectory that best matches the given language expression. ### Experimental Results Experiments show that SOLA significantly outperforms existing methods on multiple RVOS benchmark datasets, especially when dealing with complex scenes and noisy data. ### Formula Summary Some of the key formulas involved in the paper are as follows: - **Alignment Score Calculation**: \[ O_a, s_a = TS(O; T) \] where \(O_a\) is the aligned object label and \(s_a\) is the alignment score. - **Weighted Sum Calculation**: \[ w_a=\text{softmax}(\text{Avg}(O'\otimes T)) \] \[ O_a = w_a\otimes O' \] \[ s_a=\sigma(\text{Avg}(O'\otimes T)) \] - **Loss Function**: \[ L = \lambda_1L_{\text{BCE}}+\lambda_2L_{\text{align}} \] where \(L_{\text{BCE}}\) is the binary cross - entropy loss and \(L_{\text{align}}\) is the alignment loss. Through these improvements, SOLA shows excellent performance and robustness in the RVOS task.

Referring Video Object Segmentation via Language-aligned Track Selection

LiDAR Video Object Segmentation with Dynamic Kernel Refinement

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Learning Referring Video Object Segmentation from Weak Annotation

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

Region Aware Video Object Segmentation With Deep Motion Modeling

The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation

The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation

TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss

Scalable Video Object Segmentation with Simplified Framework

Towards Robust Video Object Segmentation with Adaptive Object Calibration