Abstract:Video semantic segmentation is a pivotal aspect of video representation learning. However, significant domain shifts present a challenge in effectively learning invariant spatio-temporal features across the labeled source domain and unlabeled target domain for video semantic segmentation. To solve the challenge, we propose a novel DA-STC method for domain adaptive video semantic segmentation, which incorporates a bidirectional multi-level spatio-temporal fusion module and a category-aware spatio-temporal feature alignment module to facilitate consistent learning for domain-invariant features. Firstly, we perform bidirectional spatio-temporal fusion at the image sequence level and shallow feature level, leading to the construction of two fused intermediate video domains. This prompts the video semantic segmentation model to consistently learn spatio-temporal features of shared patch sequences which are influenced by domain-specific contexts, thereby mitigating the feature gap between the source and target domain. Secondly, we propose a category-aware feature alignment module to promote the consistency of spatio-temporal features, facilitating adaptation to the target domain. Specifically, we adaptively aggregate the domain-specific deep features of each category along spatio-temporal dimensions, which are further constrained to achieve cross-domain intra-class feature alignment and inter-class feature separation. Extensive experiments demonstrate the effectiveness of our method, which achieves state-of-the-art mIOUs on multiple challenging benchmarks. Furthermore, we extend the proposed DA-STC to the image domain, where it also exhibits superior performance for domain adaptive semantic segmentation. The source code and models will be made available at \url{https://github.com/ZHE-SAPI/DA-STC}.

An Event-Driven Spatiotemporal Domain Adaptation Method for DVS Gesture Recognition

Domain Adaptive Robotic Gesture Recognition with Unsupervised Kinematic-Visual Data Alignment

Event-based Action Recognition Using Motion Information and Spiking Neural Networks

DA-STC: Domain Adaptive Video Semantic Segmentation via Spatio-Temporal Consistency.

Adversary Helps: Gradient-based Device-Free Domain-Independent Gesture Recognition

Iterative Self-Training Based Domain Adaptation for Cross-User sEMG Gesture Recognition

When Unsupervised Domain Adaptation Meets Tensor Representations.

DSDAN: Dual-Step Domain Adaptation Network Based on Bidirectional Knowledge Distillation for Cross-User Myoelectric Pattern Recognition

Deep Generative Domain Adaptation with Temporal Attention for Cross-User Activity Recognition

Fine-Grained Unsupervised Cross-Modality Domain Adaptation for Vestibular Schwannoma Segmentation

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

Sign Language Gesture Recognition and Classification Based on Event Camera with Spiking Neural Networks

VDM-DA: Virtual Domain Modeling for Source Data-free Domain Adaptation

Deep Generative Domain Adaptation with Temporal Relation Knowledge for Cross-User Activity Recognition

Dynamic Gesture Recognition Method Based on Improved R(2+1)D

VisDA-2021 Competition Universal Domain Adaptation to Improve Performance on Out-of-Distribution Data

Unsupervised Domain Adaptation for Device-free Gesture Recognition

GrabDAE: An Innovative Framework for Unsupervised Domain Adaptation Utilizing Grab-Mask and Denoise Auto-Encoder

Domain Adaptation with Self-Guided Adaptive Sampling Strategy: Feature Alignment for Cross-User Myoelectric Pattern Recognition

Multimodal Spatiotemporal Feature Map for Dynamic Gesture Recognition

Deep Siamese Domain Adaptation Convolutional Neural Network for Cross-domain Change Detection in Multispectral Images