Abstract:Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although the two lines of research share the major challenges -- overcoming the underlying domain distribution shift, their studies are largely independent, resulting in fragmented insights, a lack of holistic understanding, and missed opportunities for cross-pollination of ideas. This fragmentation prevents the unification of methods, leading to redundant efforts and suboptimal knowledge transfer across image and video domains. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general data augmentation perspective, serving as a unifying conceptual framework, enabling improved generalization, and potential for cross-pollination of ideas, ultimately contributing to the overall progress and practical impact of this field of research. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies through four-directional paths for intra- and inter-domain mixing in a feature space. To deal with temporal shifts with videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment. Extensive experiments show that our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks. Our source code and models will be released at \url{<a class="link-external link-https" href="https://github.com/ZHE-SAPI/UDASS" rel="external noopener nofollow">this https URL</a>}.

CLIP2UDA: Making Frozen CLIP Reward Unsupervised Domain Adaptation in 3D Semantic Segmentation

A New Bidirectional Unsupervised Domain Adaptation Segmentation Framework

CLIP the Divergence: Language-guided Unsupervised Domain Adaptation

CLUDA : Contrastive Learning in Unsupervised Domain Adaptation for Semantic Segmentation

Cross-modal Unsupervised Domain Adaptation for 3D Semantic Segmentation via Bidirectional Fusion-then-Distillation

xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

Adversarial unsupervised domain adaptation for 3D semantic segmentation with multi-modal learning

Adversarial Unsupervised Domain Adaptation for 3D Semantic Segmentation with 2D Image Fusion of Dense Depth

Cross-Modal Contrastive Learning for Domain Adaptation in 3D Semantic Segmentation.

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

Self-supervised Exclusive Learning for 3D Segmentation with Cross-modal Unsupervised Domain Adaptation

An Unsupervised Domain Adaption Framework for Aerial Image Semantic Segmentation Based on Curriculum Learning

Unified Domain Adaptive Semantic Segmentation

A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models

UniDA3D: Unified Domain Adaptive 3D Semantic Segmentation Pipeline

PiPa: Pixel- and Patch-wise Self-supervised Learning for Domain Adaptative Semantic Segmentation

Multi-Modal Unsupervised Domain Adaptation for Semantic Image Segmentation

Adaptive Prompt Learning with Negative Textual Semantics and Uncertainty Modeling for Universal Multi-Source Domain Adaptation

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation