Visual-guided Query with Temporal Interaction for Video Object Segementation

Jiaxin Qiu,Guoyu Yang,Jie Lei,Zunlei Feng,Ronghua Liang
DOI: https://doi.org/10.1109/icme57554.2024.10687937
2024-01-01
Abstract:The task of referring video object segmentation (RVOS) involves segmenting objects in video frames based on a given text description. However, most existing approaches treat the text directly as a query, neglecting the valuable visual and temporal information from the video. This limitation may cause the query unable to accurately perceive the target object. To address this issue, we introduce a visual-guided query with temporal interaction for referring video object segmentation (VQTI) approach. Our method capitalizes on frame-level features and video-level features to guide the query generation process, resulting in an enhanced perception of the target object. In addition, we introduce a spectral-guided segmentation optimizer module to enhance the fine-grained information, leading to more precise segmentation masks. Extensive experiments shows competitive performance against state-of-the-art approaches.
What problem does this paper attempt to address?