Panoptic Diffusion Models: co-generation of images and segmentation maps

Yinghan Long,Kaushik Roy

2024-12-04

Abstract:Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate a segmentation map of objects and a corresponding image from the prompt. Previous attempts either generate segmentation maps based on the images or provide maps as input conditions to control image generation, limiting their functionality to given inputs. Incorporating an inherent understanding of the scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. To facilitate co-generation with fewer sampling steps, we incorporate a fast diffusion solver into PDM. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing diffusion models are unable to generate corresponding segmentation maps simultaneously when generating images, which limits the functionality of the models and the diversity and authenticity of the generated images. Specifically, existing methods either generate segmentation maps based on images or use given segmentation maps as conditions to control image generation, but none can generate both images and segmentation maps simultaneously. To solve this problem, the paper proposes the Panoptic Diffusion Model (PDM), which is the first model capable of generating both images and panoptic segmentation maps simultaneously. PDM provides detailed internal guidance by constructing segmentation layouts, ensuring that the classes mentioned in the text prompts are included and enriching the paragraph diversity in the background. In addition, PDM also introduces a fast - diffusion solver to reduce the sampling steps and proposes a new evaluation metric to measure the quality of the generated segmentation maps.

Panoptic Diffusion Models: co-generation of images and segmentation maps

Unleashing Text-to-Image Diffusion Models for Visual Perception

PGDM: Multimodal Panoramic Image Generation with Diffusion Models

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Open-vocabulary Object Segmentation with Diffusion Models

One Diffusion to Generate Them All

Taming Stable Diffusion for Text to 360° Panorama Image Generation

A Survey of Data-Driven 2D Diffusion Models for Generating Images from Text

Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over Time

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Diffusion Models Without Attention

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Diffusion Models Need Visual Priors for Image Generation

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model