Abstract:Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works have explored the utilization of pre-trained VLM for the challenging open-vocabulary dense prediction task that requires perceiving diverse objects with novel classes at inference time. Existing methods construct experiments based on the public datasets of related tasks, which are not tailored for open vocabulary and rarely involve imperceptible objects camouflaged in complex scenes due to data collection bias and annotation costs. To fill in the gaps, we introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS), and construct a large-scale complex scene dataset (\textbf{OVCamo}) containing 11,483 hand-selected images with fine annotations and corresponding object classes. Further, we build a strong single-stage open-vocabulary \underline{c}amouflaged \underline{o}bject \underline{s}egmentation transform\underline{er} baseline \textbf{OVCoser} attached to the parameter-fixed CLIP with iterative semantic guidance and structure enhancement. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects. Moreover, this effective framework also surpasses previous state-of-the-arts of open-vocabulary semantic image segmentation by a large margin on our OVCamo dataset. With the proposed dataset and baseline, we hope that this new task with more practical value can further expand the research on open-vocabulary dense prediction tasks. Our code and data can be found in the \href{<a class="link-external link-https" href="https://github.com/lartpang/OVCamo" rel="external noopener nofollow">this https URL</a>}{link}.

Open-Vocabulary Scene Text Recognition Via Pseudo-Image Labeling and Margin Loss

Scene Text Detection and Recognition System for Visually Impaired People in Real World

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Open-Vocabulary Object Detection using Pseudo Caption Labels

Open-Vocabulary Object Detection via Scene Graph Discovery

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Towards open-set text recognition via label-to-prototype learning

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Open Vocabulary Scene Parsing.

Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

On Vocabulary Reliance in Scene Text Recognition

Open-Vocabulary Camouflaged Object Segmentation

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

Open-vocabulary Panoptic Segmentation with Embedding Modulation

Open-Vocabulary Object Detection with an Open Corpus

Towards Open-Vocabulary Video Semantic Segmentation

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering