MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

Yasufumi Kawano,Yoshimitsu Aoki
2024-03-17
Abstract:Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due to the limited predefined categories in supervised learning, models typically struggle with infrequent classes and are unable to predict novel classes. To address these limitations, we propose MaskDiffusion, an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods. We also demonstrate the superior performance of MaskDiffusion in handling open vocabularies, including fine-grained and proper noun-based categories, thus expanding the scope of segmentation applications. Overall, our MaskDiffusion shows significant qualitative and quantitative improvements in contrast to other comparable unsupervised segmentation methods, i.e. on the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU compared to DiffSeg). All code and data will be released at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key challenges in the field of semantic segmentation, specifically including: 1. **High Annotation Cost**: Traditional semantic segmentation methods require a large amount of pixel-level annotated data, which is very time-consuming and expensive in practical applications. 2. **Limitations of Supervised Learning**: Since supervised learning relies on predefined category sets, models often struggle to handle rare or entirely new classes. 3. **Open Vocabulary Semantic Segmentation**: A new method named MaskDiffusion is proposed, which utilizes a pre-trained Stable Diffusion model to achieve open vocabulary semantic segmentation without additional training or annotation. By combining unsupervised and open vocabulary semantic segmentation methods, this paper proposes a new framework that can effectively handle various categories, particularly excelling in dealing with fine-grained categories and proper nouns. Experimental results show that MaskDiffusion significantly outperforms existing methods on multiple benchmark datasets (such as Potsdam and COCO-Stuff). Overall, MaskDiffusion provides high-quality semantic segmentation results without the need for additional training and can handle a wider range of categories, expanding the application scope of semantic segmentation.