ClickVOS: Click Video Object Segmentation
Pinxue Guo,Lingyi Hong,Xinyu Zhou,Shuyong Gao,Wanyun Li,Jinglun Li,Zhaoyu Chen,Xiaoqiang Li,Wei Zhang,Wenqiang Zhang
DOI: https://doi.org/10.48550/arxiv.2403.06130
2024-01-01
Abstract:Video Object Segmentation (VOS) task aims to segment objects in videos.However, previous settings either require time-consuming manual masks of targetobjects at the first frame during inference or lack the flexibility to specifyarbitrary objects of interest. To address these limitations, we propose thesetting named Click Video Object Segmentation (ClickVOS) which segments objectsof interest across the whole video according to a single click per object inthe first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSPthat with point annotations to support this task. ClickVOS is of significantpractical applications and research implications due to its only 1-2 secondsinteraction time for indicating an object, comparing annotating the mask of anobject needs several minutes. However, ClickVOS also presents increasedchallenges. To address this task, we propose an end-to-end baseline approachnamed called Attention Before Segmentation (ABS), motivated by the attentionprocess of humans. ABS utilizes the given point in the first frame to perceivethe target object through a concise yet effective segmentation attention.Although the initial object mask is possibly inaccurate, in our ABS, as thevideo goes on, the initially imprecise object mask can self-heal instead ofdeteriorating due to error accumulation, which is attributed to our designedimprovement memory that continuously records stable global object memory andupdates detailed dense memory. In addition, we conduct various baselineexplorations utilizing off-the-shelf algorithms from related fields, whichcould provide insights for the further exploration of ClickVOS. Theexperimental results demonstrate the superiority of the proposed ABS approach.Extended datasets and codes will be available athttps://github.com/PinxueGuo/ClickVOS.