Abstract:Video Object Segmentation (VOS) task aims to segment objects in videos.However, previous settings either require time-consuming manual masks of targetobjects at the first frame during inference or lack the flexibility to specifyarbitrary objects of interest. To address these limitations, we propose thesetting named Click Video Object Segmentation (ClickVOS) which segments objectsof interest across the whole video according to a single click per object inthe first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSPthat with point annotations to support this task. ClickVOS is of significantpractical applications and research implications due to its only 1-2 secondsinteraction time for indicating an object, comparing annotating the mask of anobject needs several minutes. However, ClickVOS also presents increasedchallenges. To address this task, we propose an end-to-end baseline approachnamed called Attention Before Segmentation (ABS), motivated by the attentionprocess of humans. ABS utilizes the given point in the first frame to perceivethe target object through a concise yet effective segmentation attention.Although the initial object mask is possibly inaccurate, in our ABS, as thevideo goes on, the initially imprecise object mask can self-heal instead ofdeteriorating due to error accumulation, which is attributed to our designedimprovement memory that continuously records stable global object memory andupdates detailed dense memory. In addition, we conduct various baselineexplorations utilizing off-the-shelf algorithms from related fields, whichcould provide insights for the further exploration of ClickVOS. Theexperimental results demonstrate the superiority of the proposed ABS approach.Extended datasets and codes will be available athttps://github.com/PinxueGuo/ClickVOS.

Text and Click inputs for unambiguous open vocabulary instance segmentation

FocalClick: Towards Practical Interactive Image Segmentation.

Image Segmentation Using Text and Image Prompts

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

Unambiguous Scene Text Segmentation with Referring Expression Comprehension

PseudoClick: Interactive Image Segmentation with Click Imitation

WeClick: Weakly-Supervised Video Semantic Segmentation with Click Annotations

Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

NuClick: A Deep Learning Framework for Interactive Segmentation of Microscopy Images

VideoClick: Video Object Segmentation with a Single Click

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

ClickVOS: Click Video Object Segmentation

Towards Training-free Open-world Segmentation via Image Prompt Foundation Models

Improving Referring Image Segmentation using Vision-Aware Text Features

Text4Seg: Reimagining Image Segmentation as Text Generation

Text Augmented Spatial-aware Zero-shot Referring Image Segmentation