Abstract:While originally designed for image generation, diffusion models have recently shown to provide excellent pretrained feature representations for semantic segmentation. Intrigued by this result, we set out to explore how well diffusion-pretrained representations generalize to new domains, a crucial ability for any representation. We find that diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation, outperforming both supervised and self-supervised backbone networks. Motivated by this, we investigate how to utilize the model's unique ability of taking an input prompt, in order to further enhance its cross-domain performance. We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head. Moreover, we propose a simple but highly effective approach for test-time domain adaptation, based on learning a scene prompt on the target domain in an unsupervised manner. Extensive experiments conducted on four synthetic-to-real and clear-to-adverse weather benchmarks demonstrate the effectiveness of our approaches. Without resorting to any complex techniques, such as image translation, augmentation, or rare-class sampling, we set a new state-of-the-art on all benchmarks. Our implementation will be publicly available at \url{<a class="link-external link-https" href="https://github.com/ETHRuiGong/PTDiffSeg" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two key problems in **Cross - Domain Semantic Segmentation**: **Domain Generalization (DG)** and **Test - Time Domain Adaptation (TTDA)**. Specifically: 1. **Domain Generalization (DG)**: - **Problem description**: Deep neural networks perform well when the training and test data distributions are the same, but their performance will drop significantly when facing data from new domains. This is because the model is very sensitive to domain shift when the distribution of the test (target) data is different from that of the training (source) data. - **Research objective**: Explore the performance of diffusion pre - training models in cross - domain semantic segmentation tasks, especially their generalization ability on unseen domains. The author found that diffusion pre - training models perform well in domain generalization, outperforming supervised and self - supervised backbone networks. 2. **Test - Time Domain Adaptation (TTDA)**: - **Problem description**: In the test phase, how to use unlabeled target - domain data to adjust the already - trained model to improve its performance in new domains. - **Research objective**: Propose a simple and effective method to adapt to new domains by fine - tuning the scene prompt at test time without relying on complex image transformation, augmentation or rare - class sampling techniques. ### Main contributions of the paper 1. **For the first time, analyze the generalization performance of diffusion pre - training models in semantic segmentation**, showing their superior performance. 2. **Introduce prompt - based methods**, including scene prompt and prompt randomization, to further improve the domain generalization ability of the model. 3. **Propose a prompt fine - tuning method** for test - time domain adaptation, enabling the model to quickly adapt to new domains. 4. **Extensive experimental verification**, demonstrating the effectiveness of the proposed method on four benchmark datasets. In particular, in the Cityscapes → ACDC task, the DG and TTDA methods achieved 61.2% and 62.0% mIoU respectively, surpassing the existing state - of - the - art methods. ### Formula representation - **Forward diffusion process**: \[ z_p=\sqrt{\bar{\alpha}_p}z_0 + \sqrt{1-\bar{\alpha}_p}\epsilon,\quad\epsilon\sim\mathcal{N}(0, I),\quad\bar{\alpha}_p = \prod_{q = 0}^p\alpha_q \] - **Loss function**: \[ \mathbb{E}_{p\sim U[1, P]}\left\|\epsilon-\epsilon_\theta(z_p, p; C)\right\|^2 \] - **Consistency loss**: \[ L_c=\sum_{p,q\in\{1,\dots, K\},q\neq p}KL(\hat{y}_p\|\hat{y}_q)=-\sum_{p,q\in\{1,\dots, K\},q\neq p}\hat{y}_p\log\frac{\hat{y}_p}{\hat{y}_q} \] - **Total learning objective**: \[ L_{total}=\sum_{k = 1}^K CE(\hat{y}_s^k, y_s)+\lambda L_c \] - **Test - time optimization objective**:

Prompting Diffusion Representations for Cross-Domain Semantic Segmentation

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery.

Prompting to Adapt Foundational Segmentation Models

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation

MaskDiffusion: Exploiting Pre-Trained Diffusion Models for Semantic Segmentation

Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

Trans-Diff: Heterogeneous Domain Adaptation for Remote Sensing Segmentation With Transfer Diffusion

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

Harnessing Diffusion Models for Visual Perception with Meta Prompts

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

RS-Dseg: semantic segmentation of high-resolution remote sensing images based on a diffusion model component with unsupervised pretraining

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

DiGA: Distil to Generalize and then Adapt for Domain Adaptive Semantic Segmentation

Unleashing Text-to-Image Diffusion Models for Visual Perception

Label-Efficient Semantic Segmentation with Diffusion Models