Abstract:Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i.e., low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when encountering ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labeling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregations, leveraging the meticulously designed clustering module. The final results are obtained by computing the similarity between aggregated attributes and images embeddings. To evaluate the effectiveness, we annotate three types of datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation.

StructToken : Rethinking Semantic Segmentation with Structural Prior

StructToken: Rethinking Semantic Segmentation with Structural Prior

Rethinking Semantic Segmentation: A Prototype View

Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation

Token Sparsification for Faster Medical Image Segmentation

A Structural Method for Online Sketched Symbol Recognition.

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

SWAT: Spatial Structure Within and Among Tokens

Delving into Shape-aware Zero-shot Semantic Segmentation

MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation

[CLS] Token is All You Need for Zero-Shot Semantic Segmentation

Correlation Maximized Structural Similarity Loss for Semantic Segmentation

Prototype-based Semantic Segmentation

Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation

Remote sensing image semantic segmentation via class-guided structural interaction and boundary perception

Semantic Segmentation Via Structured Patch Prediction, Context Crf And Guidance Crf

AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

TFRNet: Semantic Segmentation Network with Token Filtration and Refinement Method

SSA-Seg: Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation

Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation

Exploring Context with Deep Structured models for Semantic Segmentation