Abstract:Modern deep neural networks are prone to learn domain-dependent shortcuts and thus usually suffer from severe performance degradation when tested in unseen target domains due to their poor ability of out-of-distribution generalization, which significantly limits the real-world applications. The main reason is the domain shift lying in the large distribution gap between source and unseen target data. To this end, this paper takes a step towards training robust models for domain generalizable visual tasks, which mainly focuses on learning domain-invariant visual representation to alleviate the domain shift. Specifically, we first propose an effective Hierarchical Visual Transformation (HVT) network to (1) first transform the training sample hierarchically into new domains with diverse distributions from three levels: Global, Local, and Pixel, (2) then maximize the visual discrepancy between the source domain and new domains, and minimize the cross-domain feature inconsistency to capture domain-invariant features. Besides, we further enhance the HVT network by introducing the environment-invariant learning. To be specific, we enforce the invariance of the visual representation across automatically inferred environments by minimizing invariant learning loss that considers the weighted average of environmental losses. In this way, we can prevent the model from relying on the spurious features for prediction, thus helping the model to effectively learn domain-invariant representation and narrow the domain gap in various visual matching and recognition tasks, such as stereo matching, pedestrian retrieval, and image classification. We term our extended HVT as EHVT to show distinction. We integrate our EHVT network into different models and evaluate its effectiveness and compatibility on several public benchmark datasets. Extensive experiments clearly show that our EHVT can substantially enhance the generalization performance in various tasks. Our codes are available at https://github.com/cty8998/EHVT-VisualDG.

EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones.

EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

Automated Progressive Learning for Efficient Training of Vision Transformers

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training.

Time-, Memory- and Parameter-Efficient Visual Adaptation

Spatial Transformer Networks for Curriculum Learning

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Efficient Training of Large Vision Models via Advanced Automated Progressive Learning

Network Expansion for Practical Training Acceleration

Effective Vision Transformer Training: A Data-Centric Perspective

Progressive Recurrent Learning for Visual Recognition.

Efficient Incremental Training for Deep Convolutional Neural Networks

Simpler is Better: off-the-shelf Continual Learning Through Pretrained Backbones

Large-batch Optimization for Dense Visual Predictions

LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification

Optimization-Inspired Learning with Architecture Augmentations and Control Mechanisms for Low-Level Vision

Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm