Abstract:Open-vocabulary part segmentation (OVPS) is an emerging research area focused on segmenting fine-grained entities based on diverse and previously unseen vocabularies. Our study highlights the inherent complexities of part segmentation due to intricate boundaries and diverse granularity, reflecting the knowledge-based nature of part identification. To address these challenges, we propose PartCLIPSeg, a novel framework utilizing generalized parts and object-level contexts to mitigate the lack of generalization in fine-grained parts. PartCLIPSeg integrates competitive part relationships and attention control techniques, alleviating ambiguous boundaries and underrepresented parts. Experimental results demonstrate that PartCLIPSeg outperforms existing state-of-the-art OVPS methods, offering refined segmentation and an advanced understanding of part relationships in images. Through extensive experiments, our model demonstrated an improvement over the state-of-the-art models on the Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in open - vocabulary part segmentation (OVPS). Specifically, these problems include: 1. **Insufficient generalization**: Existing methods often misclassify parts of one object as parts of another object when dealing with unseen classes. For example, "dog's leg" may be misclassified as "cat's leg", or "dog's tail" may be misclassified as "sheep's ear". 2. **Blurred boundaries**: The boundaries between different parts are not clear enough, resulting in overlapping in the prediction results. For example, "aircraft wing" may overlap with "aircraft fuselage". 3. **Ignoring small or uncommon parts**: Some smaller or less common parts are easily ignored, leading to prediction bias. For example, small parts such as "beak" and "leg" may be ignored. To address these challenges, the authors propose a new framework, PartCLIPSeg. This framework solves the problems through the following three main components: 1. **Introducing generalized part - and object - level context**: By obtaining pseudo - labels from vision - language models (VLMs) and training the model to meet the requirements of object - level and part - level supervision, the ability to recognize object boundaries is enhanced. This helps the model learn object - and part - level categories. 2. **Attention control to minimize part overlap**: By directly reducing the overlap between predicted parts, clear separation between each part is ensured. This helps solve the problem of blurred part boundaries. 3. **Normalizing activation to enhance the representation of small parts**: By normalizing self - attention information, small and uncommon regions are prevented from being ignored in pseudo - labels. This ensures that parts at the minimum granularity level are retained in the final prediction. Through these three modules, PartCLIPSeg effectively overcomes the limitations in existing OVPS methods and achieves robust multi - granularity segmentation. Experimental results show that PartCLIPSeg significantly outperforms existing methods on multiple benchmark datasets, especially in the performance on unseen classes. ### Formula summary 1. **Object and part embedding generation**: \[ e_T^{[\text{obj}|\text{part}]}=\text{CLIP}_T^*(c^{[\text{obj}|\text{part}]}) \] \[ e_I = \text{CLIP}_I^*(I) \] 2. **Feature modulation**: \[ e_I^{[\text{obj}|\text{part}]}=e_I\oplus\text{FiLM}(e_T^{[\text{obj}|\text{part}]}) \] 3. **Loss function**: \[ L_{\text{mask}}=\frac{1}{|C_{\text{obj - part}}| + 1}\sum_{i = 1}^{|C_{\text{obj - part}}|+1}(1-\text{BCE}(s_i,\hat{s}_i)) \] \[ L_{\text{sep}}=\frac{1}{|C|}\left|\left\{(h,w)\mid\sum_{c\in C}B_M^c(h,w)>1\right\}\right| \] \[ L_{\text{enh}}=1-\min_{c\in C}\left(\max_{(h,w)\in M_c}A_M^c[h,w]\right) \] \[ L_{\text{all}}=L_{\text{mask}}+\lambda_{\text{sep}}L_{\text{sep}}+\lambda_{\text{enh}}L_{\text{enh}} \] These formulas ensure that PartCLIPSeg can handle open - vocabulary parts.

Understanding Multi-Granularity for Open-Vocabulary Part Segmentation

OV-PARTS: Towards Open-Vocabulary Part Segmentation

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Going Denser with Open-Vocabulary Part Segmentation

Open-vocabulary Panoptic Segmentation with Embedding Modulation

Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

Multi-Granularity Video Object Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Search3D: Hierarchical Open-Vocabulary 3D Segmentation