Abstract:Modern deep neural networks are prone to learn domain-dependent shortcuts and thus usually suffer from severe performance degradation when tested in unseen target domains due to their poor ability of out-of-distribution generalization, which significantly limits the real-world applications. The main reason is the domain shift lying in the large distribution gap between source and unseen target data. To this end, this paper takes a step towards training robust models for domain generalizable visual tasks, which mainly focuses on learning domain-invariant visual representation to alleviate the domain shift. Specifically, we first propose an effective Hierarchical Visual Transformation (HVT) network to (1) first transform the training sample hierarchically into new domains with diverse distributions from three levels: Global, Local, and Pixel, (2) then maximize the visual discrepancy between the source domain and new domains, and minimize the cross-domain feature inconsistency to capture domain-invariant features. Besides, we further enhance the HVT network by introducing the environment-invariant learning. To be specific, we enforce the invariance of the visual representation across automatically inferred environments by minimizing invariant learning loss that considers the weighted average of environmental losses. In this way, we can prevent the model from relying on the spurious features for prediction, thus helping the model to effectively learn domain-invariant representation and narrow the domain gap in various visual matching and recognition tasks, such as stereo matching, pedestrian retrieval, and image classification. We term our extended HVT as EHVT to show distinction. We integrate our EHVT network into different models and evaluate its effectiveness and compatibility on several public benchmark datasets. Extensive experiments clearly show that our EHVT can substantially enhance the generalization performance in various tasks. Our codes are available at https://github.com/cty8998/EHVT-VisualDG.

Learning Generalized Transformation Equivariant Representations via Autoencoding Transformations

Auto-Encoding Transformations in Reparameterized Lie Groups for Unsupervised Learning.

GraphTER: Unsupervised Learning of Graph Transformation Equivariant Representations via Auto-Encoding Node-Wise Transformations

AETv2: AutoEncoding Transformations for Self-Supervised Representation Learning by Minimizing Geodesic Distances in Lie Groups

Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations

VTAE: Variational Transformer Autoencoder with Manifolds Learning

$E(2)$-Equivariant Vision Transformer

Disentangling Factors of Variation in Deep Representations Using Adversarial Training.

Group-based Learning of Disentangled Representations with Generalizability for Novel Contents

Target-Embedding Autoencoders for Supervised Representation Learning

Unsupervised Learning of Group Invariant and Equivariant Representations

Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding

Learning Color Equivariant Representations

Uniform Transformation: Refining Latent Representation in Variational Autoencoders

Transformation GAN for Unsupervised Image Synthesis and Representation Learning

Graph Neural Networks for Learning Equivariant Representations of Neural Networks

Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant VAE

Intriguing Equivalence Structures of the Embedding Space of Vision Transformers

Delving Deep into the Generalization of Vision Transformers under Distribution Shifts