Prompting Diffusion Representations for Cross-Domain Semantic Segmentation

Rui Gong,Martin Danelljan,Han Sun,Julio Delgado Mangas,Luc Van Gool
2023-07-05
Abstract:While originally designed for image generation, diffusion models have recently shown to provide excellent pretrained feature representations for semantic segmentation. Intrigued by this result, we set out to explore how well diffusion-pretrained representations generalize to new domains, a crucial ability for any representation. We find that diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation, outperforming both supervised and self-supervised backbone networks. Motivated by this, we investigate how to utilize the model's unique ability of taking an input prompt, in order to further enhance its cross-domain performance. We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head. Moreover, we propose a simple but highly effective approach for test-time domain adaptation, based on learning a scene prompt on the target domain in an unsupervised manner. Extensive experiments conducted on four synthetic-to-real and clear-to-adverse weather benchmarks demonstrate the effectiveness of our approaches. Without resorting to any complex techniques, such as image translation, augmentation, or rare-class sampling, we set a new state-of-the-art on all benchmarks. Our implementation will be publicly available at \url{<a class="link-external link-https" href="https://github.com/ETHRuiGong/PTDiffSeg" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two key problems in **Cross - Domain Semantic Segmentation**: **Domain Generalization (DG)** and **Test - Time Domain Adaptation (TTDA)**. Specifically: 1. **Domain Generalization (DG)**: - **Problem description**: Deep neural networks perform well when the training and test data distributions are the same, but their performance will drop significantly when facing data from new domains. This is because the model is very sensitive to domain shift when the distribution of the test (target) data is different from that of the training (source) data. - **Research objective**: Explore the performance of diffusion pre - training models in cross - domain semantic segmentation tasks, especially their generalization ability on unseen domains. The author found that diffusion pre - training models perform well in domain generalization, outperforming supervised and self - supervised backbone networks. 2. **Test - Time Domain Adaptation (TTDA)**: - **Problem description**: In the test phase, how to use unlabeled target - domain data to adjust the already - trained model to improve its performance in new domains. - **Research objective**: Propose a simple and effective method to adapt to new domains by fine - tuning the scene prompt at test time without relying on complex image transformation, augmentation or rare - class sampling techniques. ### Main contributions of the paper 1. **For the first time, analyze the generalization performance of diffusion pre - training models in semantic segmentation**, showing their superior performance. 2. **Introduce prompt - based methods**, including scene prompt and prompt randomization, to further improve the domain generalization ability of the model. 3. **Propose a prompt fine - tuning method** for test - time domain adaptation, enabling the model to quickly adapt to new domains. 4. **Extensive experimental verification**, demonstrating the effectiveness of the proposed method on four benchmark datasets. In particular, in the Cityscapes → ACDC task, the DG and TTDA methods achieved 61.2% and 62.0% mIoU respectively, surpassing the existing state - of - the - art methods. ### Formula representation - **Forward diffusion process**: \[ z_p=\sqrt{\bar{\alpha}_p}z_0 + \sqrt{1-\bar{\alpha}_p}\epsilon,\quad\epsilon\sim\mathcal{N}(0, I),\quad\bar{\alpha}_p = \prod_{q = 0}^p\alpha_q \] - **Loss function**: \[ \mathbb{E}_{p\sim U[1, P]}\left\|\epsilon-\epsilon_\theta(z_p, p; C)\right\|^2 \] - **Consistency loss**: \[ L_c=\sum_{p,q\in\{1,\dots, K\},q\neq p}KL(\hat{y}_p\|\hat{y}_q)=-\sum_{p,q\in\{1,\dots, K\},q\neq p}\hat{y}_p\log\frac{\hat{y}_p}{\hat{y}_q} \] - **Total learning objective**: \[ L_{total}=\sum_{k = 1}^K CE(\hat{y}_s^k, y_s)+\lambda L_c \] - **Test - time optimization objective**: