Abstract:The need for large amounts of training and validation data is a huge concern in scaling AI algorithms for autonomous driving. Semantic Image Synthesis (SIS), or label-to-image translation, promises to address this issue by translating semantic layouts to images, providing a controllable generation of photorealistic data. However, they require a large amount of paired data, incurring extra costs. In this work, we present a new task: given a dataset with synthetic images and labels and a dataset with unlabeled real images, our goal is to learn a model that can generate images with the content of the input mask and the appearance of real images. This new task reframes the well-known unsupervised SIS task in a more practical setting, where we leverage cheaply available synthetic data from a driving simulator to learn how to generate photorealistic images of urban scenes. This stands in contrast to previous works, which assume that labels and images come from the same domain but are unpaired during training. We find that previous unsupervised works underperform on this task, as they do not handle distribution shifts between two different domains. To bypass these problems, we propose a novel framework with two main contributions. First, we leverage the synthetic image as a guide to the content of the generated image by penalizing the difference between their high-level features on a patch level. Second, in contrast to previous works which employ one discriminator that overfits the target domain semantic distribution, we employ a discriminator for the whole image and multiscale discriminators on the image patches. Extensive comparisons on the benchmarks GTA-V $\rightarrow$ Cityscapes and GTA-V $\rightarrow$ Mapillary show the superior performance of the proposed model against state-of-the-art on this task.

USIS: A unified semantic image synthesis model trained on a single or multiple samples

USIS: Unsupervised Semantic Image Synthesis

Fine-grained Semantic Constraint in Image Synthesis

Semantic Image Synthesis with Unconditional Generator

Wavelet-based Unsupervised Label-to-Image Translation

Semantically Multi-Modal Image Synthesis

UNet-like network fused swin transformer and CNN for semantic image synthesis

Semantic Probability Distribution Modeling for Diverse Semantic Image Synthesis

SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

Diverse Semantic Image Synthesis via Probability Distribution Modeling

Semantic RGB-D Image Synthesis

Towards Pragmatic Semantic Image Synthesis for Urban Scenes

IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis

Unpaired semantic neural person image synthesis

Semantic Image Synthesis via Class-Adaptive Cross-Attention

Semantic Image Synthesis Via Diffusion Models

Referenceless User Controllable Semantic Image Synthesis

Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

Synthetic Instance Segmentation from Semantic Image Segmentation Masks

Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis

Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis