PiPa++: Towards Unification of Domain Adaptive Semantic Segmentation via Self-supervised Learning

Mu Chen,Zhedong Zheng,Yi Yang
2024-07-24
Abstract:Unsupervised domain adaptive segmentation aims to improve the segmentation accuracy of models on target domains without relying on labeled data from those domains. This approach is crucial when labeled target domain data is scarce or unavailable. It seeks to align the feature representations of the source domain (where labeled data is available) and the target domain (where only unlabeled data is present), thus enabling the model to generalize well to the target domain. Current image- and video-level domain adaptation have been addressed using different and specialized frameworks, training strategies and optimizations despite their underlying connections. In this paper, we propose a unified framework PiPa++, which leverages the core idea of ``comparing'' to (1) explicitly encourage learning of discriminative pixel-wise features with intraclass compactness and inter-class separability, (2) promote the robust feature learning of the identical patch against different contexts or fluctuations, and (3) enable the learning of temporal continuity under dynamic environments. With the designed task-smart contrastive sampling strategy, PiPa++ enables the mining of more informative training samples according to the task demand. Extensive experiments demonstrate the effectiveness of our method on both image-level and video-level domain adaption benchmarks. Moreover, the proposed method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the segmentation accuracy of the model on the target domain without the need for labeled data in the target domain. Specifically, the paper focuses on the semantic segmentation task in Unsupervised Domain Adaptation (UDA), aiming to achieve a unified framework for image - level and video - level domain adaptation through self - supervised learning methods. This framework can effectively reduce the difference in feature representation between the source domain and the target domain, thereby enhancing the generalization ability of the model on the target domain. ### Main Problems and Solutions 1. **Problems**: - **High cost of data annotation**: In real - world applications, obtaining a large number of data sets with pixel - level annotations is very expensive and time - consuming. - **Domain gap**: There is a significant domain gap between synthetic data and real data, resulting in a decline in the performance of the model on the target domain. - **Limitations of existing methods**: Existing image - level and video - level UDA methods usually design specific training paradigms and optimization strategies, lacking generality and flexibility. 2. **Solutions**: - **Propose the PiPa++ framework**: This framework realizes a unified architecture for image - level and video - level UDA tasks through self - supervised learning methods. - **Multi - granularity contrastive learning**: Through pixel - level and patch - level contrastive learning, enhance the model's understanding of the local context and robustness. - **Task - intelligent sampling strategy**: According to the requirements of different tasks, a task - intelligent sample mining strategy is designed to capture more useful information. - **Temporal continuity**: In dynamic scenes, maintain temporal continuity through cross - frame temporal contrastive learning. ### Specific Methods 1. **Basic Segmentation Loss**: - **Source domain segmentation loss \( L_S^{ce} \)**: \[ L_S^{ce} = \mathbb{E} \left[ -p_S^u \log h_{cls}(g_\theta(x_S^u)) \right] \] where \( p_S^u \) is the one - hot vector of the label \( y_S^u \), \( g_\theta \) is the visual backbone network, and \( h_{cls} \) is the classification head. - **Target domain segmentation loss \( L_T^{ce} \)**: \[ L_T^{ce} = \mathbb{E} \left[ -\bar{p}_T^v \log h_{cls}(g_\theta(x_T^v)) \right] \] where \( \bar{p}_T^v \) is the one - hot vector of the pseudo - label \( \bar{y}_T^v \), and the pseudo - label is generated by the teacher network \( g_{\bar{\theta}} \). 2. **Multi - granularity contrastive learning**: - **Pixel - level contrast loss \( L_{Pixel} \)**: \[ L_{Pixel} = -\sum_{C(i) = C(j)} \log \frac{r(e_i, e_j)}{\sum_{k = 1}^{N_{pixel}} r(e_i, e_k)} \] where \( e \) is the feature map extracted through the projection head \( h_{pixel} \), \( r(e_i, e_j) = \exp \left( \frac{s(e_i, e_j)}{\tau} \right) \), \( s(e_i, e_j) \) is the cosine similarity of two pixel features, and \( \tau \) is the temperature parameter. - **Patch - level contrast loss \( L_{Patch} \)**: \[ L_{Patch} = -\sum_{O_1(i) = O_2(j)} \log \frac{r(f_i, f_j)}{\sum_{k =