Abstract:Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?" we affirmatively respond and introduce a robust zero-shot COS framework. This framework leverages the inherent local pattern bias of COS and employs a broad semantic feature space derived from salient object segmentation (SOS) for efficient zero-shot transfer. We incorporate an Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM pre-trained image encoder focuses on capturing essential low-level features, while the M-LLM generates caption embeddings processed alongside these visual cues. These embeddings are precisely aligned using MFA, enabling our framework to accurately interpret and navigate complex semantic contexts. To optimize operational efficiency, we introduce a learnable codebook that represents the M-LLM during inference, significantly reducing computational overhead. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with $F_{\beta}^w$ scores of 72.9\% on CAMO and 71.7\% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Code: <a class="link-external link-https" href="https://github.com/R-LEI360725/ZSCOS-CaMF" rel="external noopener nofollow">this https URL</a>

Zero-Shot Video Object Segmentation with Co-Attention Siamese Networks

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks

Co-attention Propagation Network for Zero-Shot Video Object Segmentation

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations

CoNet: A Consistency-Oriented Network for Camouflaged Object Segmentation

Motion-Attentive Transition for Zero-Shot Video Object Segmentation

Channel and spatial attention based deep object co-segmentation

Multi-Similarity Enhancement Network for Few-Shot Segmentation.

Group-wise Deep Object Co-Segmentation with Co-Attention Recurrent Neural Network

Zero-Shot Co-salient Object Detection Framework

Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks

Semantic Aware Attention Based Deep Object Co-segmentation

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Motion-Guided Spatial Time Attention for Video Object Segmentation.

A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

Adversarial domain adaptation with Siamese network for video object cosegmentation

Camouflaged Object Segmentation with Omni Perception

Deep Object Co-segmentation via Spatial-Semantic Network Modulation

Delving into Shape-aware Zero-shot Semantic Segmentation

Toward Stable Co-Saliency Detection and Object Co-Segmentation