Abstract:Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how machine learning models trained on multiple source domains can effectively generalize on unseen target domains. Specifically, the paper focuses on the **Domain Generalization (DG)** problem, that is, how to train a model so that it can handle data from new domains (not present during the training process). To achieve this goal, the authors propose a new method, which uses the knowledge of large vision - language models (such as CLIP) to train a smaller student model to improve its generalization ability in unseen domains.
### Core contributions of the paper:
1. **Using large - vision - language models as teacher models for knowledge distillation for the first time**: The paper proposes a novel method, which uses large - vision - language models such as CLIP to guide the learning process of the student model. This is the first attempt to use this type of model in domain generalization.
2. **Introducing a regularization strategy based on text representation**: The authors propose a new regularization method, which makes the representation learned by the student model from images as close as possible to the representation learned by the teacher model from text. Since text can more concisely capture the semantic core of an image, this method helps the student model learn more generalized features.
3. **Designing two distance loss functions**:
- **Absolute Distance Loss**: Directly pushes the image representation learned by the student model towards the text representation of the teacher model.
- **Relative Distance Loss**: Ensures that the relative positions of the representations learned by the student model in different domains are consistent with the relative positions of the text representations of the teacher model in these domains.
4. **Rich experimental verification**: The paper conducts extensive experiments on multiple benchmark datasets to verify the effectiveness of the proposed method, and analyzes the contributions of each component through ablation studies.
### Method overview:
- **Model structure**: Use pre - trained and frozen CLIP as the teacher model, and the student model is a smaller image encoder.
- **Loss functions**:
- **Standard Empirical Risk Minimization (ERM) loss**: Used for supervised learning.
- **Model distillation loss**: Utilizes the pre - trained weights of the image encoder part of CLIP.
- **Cross - domain (text - to - image) distance loss**: Utilizes the language encoder part of CLIP, including the absolute distance loss and the relative distance loss.
### Experimental results:
- **Zero - shot performance**: CLIP has excellent zero - shot performance on datasets such as PACS, VLCS, Office - Home, and Terra Incognita. Except for Terra Incognita, its performance on other datasets is better than the existing best methods.
- **Comparison with existing DG methods**: On ResNet18 and ResNet50, the method proposed in the paper significantly outperforms the existing domain generalization methods, especially after adding the absolute distance loss and the relative distance loss, the performance improvement is obvious.
### Ablation studies:
- **Text embedding vs image embedding**: Using the text embedding of CLIP as a supervision signal is more effective than using the image embedding, which verifies that the text embedding contains rich semantic information.
- **Influence of each loss component**: Each loss component makes a significant contribution to the final performance, especially the combination of the absolute distance loss and the relative distance loss has the best effect.
In conclusion, this paper significantly improves the generalization ability of the student model in unseen domains by introducing a knowledge distillation method based on large - vision - language models.