Abstract:Knowledge distillation (KD) has been widely used to transfer knowledge from large, accurate models (teachers) to smaller, efficient ones (students). Recent methods have explored enforcing consistency by incorporating causal interpretations to distill invariant representations. In this work, we extend this line of research by introducing a dual augmentation strategy to promote invariant feature learning in both teacher and student models. Our approach leverages different augmentations applied to both models during distillation, pushing the student to capture robust, transferable features. This dual augmentation strategy complements invariant causal distillation by ensuring that the learned representations remain stable across a wider range of data variations and transformations. Extensive experiments on CIFAR-100 demonstrate the effectiveness of this approach, achieving competitive results in same-architecture KD.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve some key challenges in the knowledge distillation (KD) process, especially how to improve the learning efficiency and generalization ability of the student model when transferring knowledge between models with different architectures. Specifically, this research focuses on the following aspects: 1. **Limitations of traditional KD methods**: - Traditional KD methods usually rely on matching the output distributions of the teacher model and the student model, which may lead to the student model over - fitting the output of the teacher model and failing to capture deeper - level structured knowledge. - When the architectures of the teacher and student models are quite different, it is difficult for the student model to learn robust and transferable features. Especially in cross - domain or out - of - distribution (OOD) scenarios, the performance of the student model may decline. 2. **Causal invariant representation learning**: - In order to improve the generalization ability of the student model, the research introduces the concept of causal reasoning, especially by learning causal invariant representations. These representations remain stable under different data variations and transformations, thus improving the robustness and generalization ability of the model. - Traditional machine - learning models often only focus on statistical correlations in the data and ignore the underlying causal mechanisms, which leads to poor performance when the model faces distribution changes. Causal representation learning enables the model to better cope with distribution changes by separating irrelevant variables such as content and style. 3. **Application of the dual augmentation strategy**: - To solve the above problems, this research proposes a dual augmentation strategy, that is, applying different data augmentation methods to the teacher model and the student model respectively. This method not only enhances the invariance of the representations learned by the student model but also improves its stability under various data transformations. - Specifically, by applying different augmentation methods, the student model is forced to learn more robust and transferable features that remain consistent under a wide range of transformation conditions. ### Main contributions of the paper - **Introduction of the dual augmentation strategy**: By applying different data augmentation methods to the teacher and student models respectively, the student model is encouraged to learn more robust and generalized features. - **Combination with causal reasoning**: Using the principles of causal reasoning to ensure that the features learned by the student model remain invariant under different data conditions and transformations, thus improving the generalization ability of the model. - **Experimental verification**: Through extensive experiments on multiple benchmark datasets (such as CIFAR - 100, TIN - 200, STL - 10), it is proved that this method achieves state - of - the - art performance in the KD setting of the same architecture and shows superior generalization ability in OOD tasks. In summary, this paper solves the deficiencies of traditional KD methods in learning robust and generalized features by introducing the dual augmentation strategy and causal reasoning, and significantly improves the performance of the student model in different tasks and data distributions.

Distilling Invariant Representations with Dual Augmentation

Knowledge Augmentation for Distillation: A General and Effective Approach to Enhance Knowledge Distillation

Role-Wise Data Augmentation for Knowledge Distillation

HARD: Hard Augmentations for Robust Distillation

Revisiting Knowledge Distillation: an Inheritance and Exploration Framework

TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant

Relational Representation Distillation

An Empirical Analysis of the Impact of Data Augmentation on Knowledge Distillation

Improving Knowledge Distillation with Teacher's Explanation

DCD: Discriminative and Consistent Representation Distillation

Understanding the Effect of Data Augmentation on Knowledge Distillation

Comparative Knowledge Distillation

Isotonic Data Augmentation for Knowledge Distillation

Small Scale Data-Free Knowledge Distillation

Wasserstein Contrastive Representation Distillation

Dual teachers for self-knowledge distillation

Learning Interpretation with Explainable Knowledge Distillation

Faithful Knowledge Distillation

Towards Effective Data-Free Knowledge Distillation via Diverse Diffusion Augmentation

Avatar Knowledge Distillation: Self-ensemble Teacher Paradigm with Uncertainty

An Embarrassingly Simple Approach for Knowledge Distillation