Understanding Multi-Granularity for Open-Vocabulary Part Segmentation

Jiho Choi,Seonho Lee,Seungho Lee,Minhyun Lee,Hyunjung Shim
2024-06-17
Abstract:Open-vocabulary part segmentation (OVPS) is an emerging research area focused on segmenting fine-grained entities based on diverse and previously unseen vocabularies. Our study highlights the inherent complexities of part segmentation due to intricate boundaries and diverse granularity, reflecting the knowledge-based nature of part identification. To address these challenges, we propose PartCLIPSeg, a novel framework utilizing generalized parts and object-level contexts to mitigate the lack of generalization in fine-grained parts. PartCLIPSeg integrates competitive part relationships and attention control techniques, alleviating ambiguous boundaries and underrepresented parts. Experimental results demonstrate that PartCLIPSeg outperforms existing state-of-the-art OVPS methods, offering refined segmentation and an advanced understanding of part relationships in images. Through extensive experiments, our model demonstrated an improvement over the state-of-the-art models on the Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in open - vocabulary part segmentation (OVPS). Specifically, these problems include: 1. **Insufficient generalization**: Existing methods often misclassify parts of one object as parts of another object when dealing with unseen classes. For example, "dog's leg" may be misclassified as "cat's leg", or "dog's tail" may be misclassified as "sheep's ear". 2. **Blurred boundaries**: The boundaries between different parts are not clear enough, resulting in overlapping in the prediction results. For example, "aircraft wing" may overlap with "aircraft fuselage". 3. **Ignoring small or uncommon parts**: Some smaller or less common parts are easily ignored, leading to prediction bias. For example, small parts such as "beak" and "leg" may be ignored. To address these challenges, the authors propose a new framework, PartCLIPSeg. This framework solves the problems through the following three main components: 1. **Introducing generalized part - and object - level context**: By obtaining pseudo - labels from vision - language models (VLMs) and training the model to meet the requirements of object - level and part - level supervision, the ability to recognize object boundaries is enhanced. This helps the model learn object - and part - level categories. 2. **Attention control to minimize part overlap**: By directly reducing the overlap between predicted parts, clear separation between each part is ensured. This helps solve the problem of blurred part boundaries. 3. **Normalizing activation to enhance the representation of small parts**: By normalizing self - attention information, small and uncommon regions are prevented from being ignored in pseudo - labels. This ensures that parts at the minimum granularity level are retained in the final prediction. Through these three modules, PartCLIPSeg effectively overcomes the limitations in existing OVPS methods and achieves robust multi - granularity segmentation. Experimental results show that PartCLIPSeg significantly outperforms existing methods on multiple benchmark datasets, especially in the performance on unseen classes. ### Formula summary 1. **Object and part embedding generation**: \[ e_T^{[\text{obj}|\text{part}]}=\text{CLIP}_T^*(c^{[\text{obj}|\text{part}]}) \] \[ e_I = \text{CLIP}_I^*(I) \] 2. **Feature modulation**: \[ e_I^{[\text{obj}|\text{part}]}=e_I\oplus\text{FiLM}(e_T^{[\text{obj}|\text{part}]}) \] 3. **Loss function**: \[ L_{\text{mask}}=\frac{1}{|C_{\text{obj - part}}| + 1}\sum_{i = 1}^{|C_{\text{obj - part}}|+1}(1-\text{BCE}(s_i,\hat{s}_i)) \] \[ L_{\text{sep}}=\frac{1}{|C|}\left|\left\{(h,w)\mid\sum_{c\in C}B_M^c(h,w)>1\right\}\right| \] \[ L_{\text{enh}}=1-\min_{c\in C}\left(\max_{(h,w)\in M_c}A_M^c[h,w]\right) \] \[ L_{\text{all}}=L_{\text{mask}}+\lambda_{\text{sep}}L_{\text{sep}}+\lambda_{\text{enh}}L_{\text{enh}} \] These formulas ensure that PartCLIPSeg can handle open - vocabulary parts.