PiPa: Pixel- and Patch-wise Self-supervised Learning for Domain Adaptative Semantic Segmentation

Mu Chen,Zhedong Zheng,Yi Yang,Tat-Seng Chua
DOI: https://doi.org/10.48550/arXiv.2211.07609
2022-11-15
Abstract:Unsupervised Domain Adaptation (UDA) aims to enhance the generalization of the learned model to other domains. The domain-invariant knowledge is transferred from the model trained on labeled source domain, e.g., video game, to unlabeled target domains, e.g., real-world scenarios, saving annotation expenses. Existing UDA methods for semantic segmentation usually focus on minimizing the inter-domain discrepancy of various levels, e.g., pixels, features, and predictions, for extracting domain-invariant knowledge. However, the primary intra-domain knowledge, such as context correlation inside an image, remains underexplored. In an attempt to fill this gap, we propose a unified pixel- and patch-wise self-supervised learning framework, called PiPa, for domain adaptive semantic segmentation that facilitates intra-image pixel-wise correlations and patch-wise semantic consistency against different contexts. The proposed framework exploits the inherent structures of intra-domain images, which: (1) explicitly encourages learning the discriminative pixel-wise features with intra-class compactness and inter-class separability, and (2) motivates the robust feature learning of the identical patch against different contexts or fluctuations. Extensive experiments verify the effectiveness of the proposed method, which obtains competitive accuracy on the two widely-used UDA benchmarks, i.e., 75.6 mIoU on GTA to Cityscapes and 68.2 mIoU on Synthia to Cityscapes. Moreover, our method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address a key issue in Unsupervised Domain Adaptation (UDA): how to improve the generalization ability of models across different domains without incurring additional annotation costs. Specifically, the paper focuses on the task of semantic segmentation from the source domain (e.g., video game images) to the target domain (e.g., real-world scenes), and how to extract and utilize domain-invariant knowledge. ### Main Contributions 1. **Proposed a unified pixel-level and patch-level self-supervised learning framework (PiPa)**: - **Pixel-wise Contrast**: Enhances intra-class compactness and inter-class separability by pulling closer the pixel features of the same class and pushing away the pixel features of different classes. - **Patch-wise Contrast**: Enhances the model's understanding of local context by maintaining the prediction consistency of local patches in different contexts. 2. **No additional annotations required**: The PiPa method can improve the model's performance without introducing additional annotations. 3. **Compatible with existing UDA methods**: PiPa can be combined with other existing UDA methods to further enhance performance without adding extra parameters. ### Method Overview - **Basic Segmentation Loss**: Learns basic knowledge by applying cross-entropy loss on the source domain. - **Pseudo-Label Generation**: Generates pseudo-labels on the target domain and uses these pseudo-labels for training. - **Mixed Data Generation**: Generates mixed labels and mixed inputs by mixing data from the source and target domains to facilitate stable training. - **Pixel-wise Contrast**: Enhances the features of positive sample pairs and pushes away the features of negative sample pairs by calculating pixel similarity. - **Patch-wise Contrast**: Maintains the prediction consistency of partially overlapping patches by calculating the feature consistency of local patches in different contexts. ### Experimental Results - **GTA → Cityscapes**: On the GTA → Cityscapes benchmark, PiPa achieved a performance of 75.6 mIoU, significantly outperforming other methods. - **SYNTHIA → Cityscapes**: On the SYNTHIA → Cityscapes benchmark, PiPa achieved a performance of 68.2 mIoU, also showing excellent results. ### Conclusion By introducing pixel-level and patch-level self-supervised learning, the PiPa method can effectively exploit inherent contextual information within the domain, resulting in significant performance improvements in unsupervised domain adaptation tasks. This method is not only applicable to existing UDA frameworks but also has high practical value in real-world applications.