Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

Jinglong Wang,Xiawei Li,Jing Zhang,Qingyuan Xu,Qin Zhou,Qian Yu,Lu Sheng,Dong Xu
2023-01-01
Abstract:The pre-trained text-image discriminative models, such as CLIP, has beenexplored for open-vocabulary semantic segmentation with unsatisfactory resultsdue to the loss of crucial localization information and awareness of objectshapes. Recently, there has been a growing interest in expanding theapplication of generative models from generation tasks to semanticsegmentation. These approaches utilize generative models either for generatingannotated data or extracting features to facilitate semantic segmentation. Thistypically involves generating a considerable amount of synthetic data orrequiring additional mask annotations. To this end, we uncover the potential ofgenerative text-to-image diffusion models (e.g., Stable Diffusion) as highlyefficient open-vocabulary semantic segmenters, and introduce a noveltraining-free approach named DiffSegmenter. The insight is that to generaterealistic objects that are semantically faithful to the input text, both thecomplete object shapes and the corresponding semantics are implicitly learnedby diffusion models. We discover that the object shapes are characterized bythe self-attention maps while the semantics are indicated through thecross-attention maps produced by the denoising U-Net, forming the basis of oursegmentation results.Additionally, we carefully design effective textualprompts and a category filtering mechanism to further enhance the segmentationresults. Extensive experiments on three benchmark datasets show that theproposed DiffSegmenter achieves impressive results for open-vocabulary semanticsegmentation.
What problem does this paper attempt to address?