Abstract:Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although the two lines of research share the major challenges -- overcoming the underlying domain distribution shift, their studies are largely independent, resulting in fragmented insights, a lack of holistic understanding, and missed opportunities for cross-pollination of ideas. This fragmentation prevents the unification of methods, leading to redundant efforts and suboptimal knowledge transfer across image and video domains. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general data augmentation perspective, serving as a unifying conceptual framework, enabling improved generalization, and potential for cross-pollination of ideas, ultimately contributing to the overall progress and practical impact of this field of research. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies through four-directional paths for intra- and inter-domain mixing in a feature space. To deal with temporal shifts with videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment. Extensive experiments show that our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks. Our source code and models will be released at \url{<a class="link-external link-https" href="https://github.com/ZHE-SAPI/UDASS" rel="external noopener nofollow">this https URL</a>}.

Video domain adaptation for semantic segmentation using perceptual consistency matching

ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation

A New Bidirectional Unsupervised Domain Adaptation Segmentation Framework

MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

Unified Domain Adaptive Semantic Segmentation

Spatio-Temporal Pixel-Level Contrastive Learning-based Source-Free Domain Adaptation for Video Semantic Segmentation

Adversarial Unsupervised Domain Adaptation for 3D Semantic Segmentation with 2D Image Fusion of Dense Depth

Adversarial unsupervised domain adaptation for 3D semantic segmentation with multi-modal learning

PiPa: Pixel- and Patch-wise Self-supervised Learning for Domain Adaptative Semantic Segmentation

Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Multi-Modal Unsupervised Domain Adaptation for Semantic Image Segmentation

Integrating multimodal contrastive learning with prototypical domain alignment for unsupervised domain adaptation of time series

Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation Using Region and Category Adaptive Domain Discriminator

Threshold-adaptive Unsupervised Focal Loss for Domain Adaptation of Semantic Segmentation

Style Adaptation for Domain-adaptive Semantic Segmentation

PiPa++: Towards Unification of Domain Adaptive Semantic Segmentation via Self-supervised Learning

Enhancing Visual Domain Adaptation with Source Preparation

Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning

We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models

Learning intra-domain style-invariant representation for unsupervised domain adaptation of semantic segmentation