MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

Yasufumi Kawano,Yoshimitsu Aoki

2024-03-17

Abstract:Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due to the limited predefined categories in supervised learning, models typically struggle with infrequent classes and are unable to predict novel classes. To address these limitations, we propose MaskDiffusion, an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods. We also demonstrate the superior performance of MaskDiffusion in handling open vocabularies, including fine-grained and proper noun-based categories, thus expanding the scope of segmentation applications. Overall, our MaskDiffusion shows significant qualitative and quantitative improvements in contrast to other comparable unsupervised segmentation methods, i.e. on the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU compared to DiffSeg). All code and data will be released at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key challenges in the field of semantic segmentation, specifically including: 1. **High Annotation Cost**: Traditional semantic segmentation methods require a large amount of pixel-level annotated data, which is very time-consuming and expensive in practical applications. 2. **Limitations of Supervised Learning**: Since supervised learning relies on predefined category sets, models often struggle to handle rare or entirely new classes. 3. **Open Vocabulary Semantic Segmentation**: A new method named MaskDiffusion is proposed, which utilizes a pre-trained Stable Diffusion model to achieve open vocabulary semantic segmentation without additional training or annotation. By combining unsupervised and open vocabulary semantic segmentation methods, this paper proposes a new framework that can effectively handle various categories, particularly excelling in dealing with fine-grained categories and proper nouns. Experimental results show that MaskDiffusion significantly outperforms existing methods on multiple benchmark datasets (such as Potsdam and COCO-Stuff). Overall, MaskDiffusion provides high-quality semantic segmentation results without the need for additional training and can handle a wider range of categories, expanding the application scope of semantic segmentation.

MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

MaskDiffusion: Exploiting Pre-Trained Diffusion Models for Semantic Segmentation

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation

Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

DFormer: Diffusion-guided Transformer for Universal Image Segmentation

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery.

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

Denoising Diffusion Semantic Segmentation with Mask Prior Modeling

Open-vocabulary Object Segmentation with Diffusion Models

Label-Efficient Semantic Segmentation with Diffusion Models

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

DifFSS: Diffusion Model for Few-Shot Semantic Segmentation

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

Few-shot Semantic Segmentation Via Perceptual Attention and Spatial Control

Enhancing Label-efficient Medical Image Segmentation with Text-guided Diffusion Models