AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

Chaofan Ma,Yuhuan Yang,Chen Ju,Fei Zhang,Ya Zhang,Yanfeng Wang
2024-01-06
Abstract:Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i.e., low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when encountering ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labeling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregations, leveraging the meticulously designed clustering module. The final results are obtained by computing the similarity between aggregated attributes and images embeddings. To evaluate the effectiveness, we annotate three types of datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of low-quality new category names in practical applications of the Open-Vocabulary Semantic Segmentation (OVSS) task. Specifically, existing methods usually assume that new text category names are accurate, complete, and present in the pre-trained vocabulary. However, this assumption often does not hold in real-world scenarios, mainly due to the following three problems: 1. **Ambiguity**: Short or incomplete names can lead to lexical ambiguity, affecting semantic differentiation ability. 2. **New Words**: Newly emerged words may not be in the pre-trained vocabulary, causing the pre-trained language model to fail in interpreting their semantics. 3. **Difficult to Name**: Some categories may not have known or easily describable names, especially those involving technical terms, rare animal names, etc. To address these issues, the authors propose a new framework based on Attribute Decomposition-Aggregation, called AttrSeg. The main contributions of this framework include: - **Attribute Decomposition**: Decomposing category names into multiple attribute descriptions to supplement semantic context and improve semantic differentiation ability. - **Attribute Aggregation**: Integrating multiple attribute descriptions into a global description through a multi-level aggregation architecture to form a discriminative classifier. - **Dataset Construction**: Constructing detailed attribute descriptions for existing datasets (such as PASCAL and COCO) as well as newly collected datasets (such as Fantastic Beasts). Through these methods, AttrSeg can better handle OVSS issues in practical applications and improve segmentation performance.