DFormer: Diffusion-guided Transformer for Universal Image Segmentation

Hefeng Wang,Jiale Cao,Rao Muhammad Anwer,Jin Xie,Fahad Shahbaz Khan,Yanwei Pang
2023-06-08
Abstract:This paper introduces an approach, named DFormer, for universal image segmentation. The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model. DFormer first adds various levels of Gaussian noise to ground-truth masks, and then learns a model to predict denoising masks from corrupted masks. Specifically, we take deep pixel-level features along with the noisy masks as inputs to generate mask features and attention masks, employing diffusion-based decoder to perform mask prediction gradually. At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks. Extensive experiments reveal the merits of our proposed contributions on different image segmentation tasks: panoptic segmentation, instance segmentation, and semantic segmentation. Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set. Further, DFormer achieves promising semantic segmentation performance outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our source code and models will be publicly on <a class="link-external link-https" href="https://github.com/cp3wan/DFormer" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to design an effective Transformer method based on the diffusion model to achieve competitive general - purpose image segmentation performance. Specifically, existing image segmentation methods are usually optimized for specific tasks and are difficult to be effectively generalized to different image segmentation tasks. Therefore, the paper proposes DFormer, which is a diffusion - guided Transformer framework for general - purpose image segmentation. DFormer views image segmentation as a process of generating from noise masks. During the training process, noise masks are generated by adding different levels of Gaussian noise to the real - label masks, and then the Transformer decoder is used to predict the real - label masks from the noise masks. In the inference stage, DFormer directly predicts masks and their corresponding classes from a set of randomly generated noise masks. Through this method, DFormer aims to overcome the problem of insufficient generalization ability of existing methods among different image segmentation tasks, thereby achieving consistent performance improvements in multiple tasks such as panoptic segmentation, instance segmentation, and semantic segmentation. Experimental results show that DFormer outperforms the recent diffusion - model - based panoptic segmentation method Pix2Seq - D by 3.6% in the panoptic segmentation task on the MS COCO validation set, and outperforms the recent diffusion - model - based method by 2.2% in the semantic segmentation task on the ADE20K validation set. This indicates that DFormer has advantages not only in performance improvement but also in parameter efficiency and training efficiency.