A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

Thomas Stegmüller,Tim Lebailly,Nikola Dukic,Behzad Bozorgtabar,Tinne Tuytelaars,Jean-Philippe Thiran

2024-07-01

Abstract:Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve open - vocabulary zero - shot segmentation in an open - vocabulary environment. Specifically, although existing models perform well in zero - shot classification tasks, they do not perform well in dense tasks such as zero - shot semantic segmentation. This is mainly because these models lack localization cues and there is entanglement in the process of image representation learning and cross - modal alignment. The paper proposes a simple framework named SimZSS, which aims to solve these problems through the following two key principles: 1. **Utilize frozen visual models**: These models have spatial awareness, and SimZSS only aligns the text encoder. 2. **Utilize the discrete nature of text and language knowledge**: By identifying local concepts in the title, the corresponding concepts can be found in the image. Through these methods, SimZSS can adapt to small - scale carefully curated datasets and large - scale noisy datasets when using image - caption pair datasets, and achieve state - of - the - art results on multiple benchmark datasets.

A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation

Transformer-Based Approach Via Contrastive Learning for Zero-Shot Detection.

Weakly Supervised Classification Model for Zero‐shot Semantic Segmentation

Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

Exploring Open-Vocabulary Semantic Segmentation without Human Labels

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Delving into Shape-aware Zero-shot Semantic Segmentation

[CLS] Token is All You Need for Zero-Shot Semantic Segmentation

Text Augmented Spatial-aware Zero-shot Referring Image Segmentation

Zero-shot Unsupervised Transfer Instance Segmentation

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

Zero-Shot Scene Classification for High Spatial Resolution Remote Sensing Images

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

TransZero: Attribute-guided Transformer for Zero-Shot Learning