Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation

Ba Hung Ngo,Nhat-Tuong Do-Tran,Tuan-Ngoc Nguyen,Hae-Gon Jeon,Tae Jong Choi

2024-04-26

Abstract:Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs). They align the distribution differences between domains as encoders without considering their unique characteristics. For instance, ViT excels in accuracy due to its superior ability to capture global representations, while CNN has an advantage in capturing local representations. This fact has led us to design a hybrid method to fully take advantage of both ViT and CNN, called Explicitly Class-specific Boundaries (ECB). ECB learns CNN on ViT to combine their distinct strengths. In particular, we leverage ViT's properties to explicitly find class-specific decision boundaries by maximizing the discrepancy between the outputs of the two classifiers to detect target samples far from the source support. In contrast, the CNN encoder clusters target features based on the previously defined class-specific boundaries by minimizing the discrepancy between the probabilities of the two classifiers. Finally, ViT and CNN mutually exchange knowledge to improve the quality of pseudo labels and reduce the knowledge discrepancies of these models. Compared to conventional DA methods, our ECB achieves superior performance, which verifies its effectiveness in this hybrid model. The project website can be found

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problems This Paper Attempts to Solve This paper aims to address several key issues in the task of Domain Adaptation (DA): 1. **Combining the Advantages of ViT and CNN**: - The paper proposes a hybrid model that combines the strengths of Vision Transformers (ViT) and Convolutional Neural Networks (CNN). ViT excels at capturing global representations, while CNN is adept at capturing local features. By integrating the advantages of these two models, the approach can better handle domain adaptation tasks. 2. **Expanding Class-Specific Boundaries**: - The paper introduces a method to expand class-specific boundaries by maximizing the discrepancy between the outputs of two classifiers. This approach can help identify target samples that are far from the support of the source domain. 3. **Reducing Knowledge Discrepancy**: - To reduce the knowledge discrepancy between ViT and CNN, the paper employs a co-training strategy. In this way, the two models can exchange knowledge with each other, improving the quality of pseudo-labels and reducing the knowledge gap between the models. 4. **Improving Semi-Supervised Domain Adaptation (SSDA) Performance**: - The paper demonstrates the superior performance of its method in SSDA scenarios, especially when using a small number of labeled target samples. Compared to Unsupervised Domain Adaptation (UDA), SSDA can leverage more information, thereby improving classification accuracy. In summary, the goal of this paper is to enhance classification performance in domain adaptation tasks by combining the strengths of ViT and CNN and employing a co-training strategy.

Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation

Trust-aware Conditional Adversarial Domain Adaptation with Feature Norm Alignment.

Dual Dynamic Consistency Regularization for Semi-Supervised Domain Adaptation

Joint Progressive Knowledge Distillation and Unsupervised Domain Adaptation

When CNN meet with ViT: decision-level feature fusion for camouflaged object detection

ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation

TFC: Transformer Fused Convolution for Adversarial Domain Adaptation

A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation

DenseViT: A Hybrid CNN-Vision Transformer Model for an Improved Multisensor Lithological Classification

Model-Contrastive Federated Domain Adaptation

Cumulative Spatial Knowledge Distillation for Vision Transformers

Hybrid ViT-CNN Network for Fine-Grained Image Classification

Confidence Attention and Generalization Enhanced Distillation for Continuous Video Domain Adaptation

Scalable domain adaptation of convolutional neural networks

Supervised domain adaptation for building extraction from off-nadir aerial images

TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

Vision Transformer-based Adversarial Domain Adaptation

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Generalized Domain Conditioned Adaptation Network

When Unsupervised Domain Adaptation Meets Tensor Representations.