Abstract:Most logit-based knowledge distillation methods transfer soft labels from the teacher model to the student model via Kullback–Leibler divergence based on softmax, an exponential normalization function. However, this exponential nature of softmax tends to prioritize the largest class (target class) while neglecting smaller ones (non-target classes), leading to an oversight of the non-target classes's significance. To address this issue, we propose Non-Target-Class-Enhanced Knowledge Distillation (NTCE-KD) to amplify the role of non-target classes both in terms of magnitude and diversity. Specifically, we present a magnitude-enhanced Kullback–Leibler (MKL) divergence multi-shrinking the target class to enhance the impact of non-target classes in terms of magnitude. Additionally, to enrich the diversity of non-target classes, we introduce a diversity-based data augmentation strategy (DDA), further enhancing overall performance. Extensive experimental results on the CIFAR-100 and ImageNet-1k datasets demonstrate that non-target classes are of great significance and that our method achieves state-of-the-art performance across a wide range of teacher–student pairs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **When existing Logit - based knowledge distillation methods transfer knowledge, due to the exponential nature of the Softmax function, they pay too much attention to the target class (Target Class) and ignore the importance of non - target classes (Non - Target Class)**. This imbalance results in the knowledge of non - target classes not being fully utilized, thus limiting the learning effect and generalization ability of the student model. ### Specific problem description 1. **Limitations of the Softmax function**: - Most Logit - based knowledge distillation methods pass the soft labels of the teacher model to the student model through Kullback - Leibler (KL) divergence. - The exponential characteristic of the Softmax function tends to magnify the probability of the target class and ignore the information of non - target classes, which makes the knowledge of non - target classes fail to play a full role. 2. **Imbalance in the optimization process**: - In the traditional KL divergence, the target class usually obtains a higher probability value, resulting in a stronger optimization gradient, while the optimization of non - target classes is ignored. - This imbalance may lead to over - fitting of the student model to the target class and inability to effectively learn the correlations between classes. 3. **Lack of data diversity**: - Existing methods usually process samples from a single perspective and fail to fully mine the knowledge of class correlations of samples from different perspectives. - The lack of diversity enhancement strategies limits the comprehensive utilization of non - target class knowledge. ### Solutions proposed in the paper To solve the above problems, the paper proposes the **Non - Target - Class - Enhanced Knowledge Distillation (NTCE - KD)** method, which mainly includes the following two improvements: 1. **Magnitude Enhancement**: - Introduce an improved KL divergence - Magnitude - Enhanced KL (MKL), which enhances the role of non - target classes by multi - scale compression (Multi - Shrinkage) of the target class Logits of the teacher model and the student model. - Expressed by the formula: $$ \tilde{v}(y_n)_n = v(y_n)_n - S_n $$ where $S_n$ is the target class compression amount of the $n$ - th sample, and the calculation method is: $$ S_n^0 = v(y_n)_n - \max_{k \in [1, K], k \neq y_n} v(k)_n $$ and further enrich the soft label information through the scaling coefficient $\lambda_m$. 2. **Diversity Enhancement**: - Propose a diversity - based data enhancement strategy (DDA), which increases the diversity of non - target classes by generating different views of samples. - Use a gradient - independent search method to find the best data enhancement strategy to maximize the diversity of non - target classes: $$ \arg \min_{a \in A, b \in B} \sum_{n = 1}^N \sum_{k = 1, k \neq y_n}^K \text{Sim}(p(f_T(\hat{X}_n))(k), p(f_T(X_n))(k)) $$ where $\text{Sim}$ represents the cosine similarity of non - target class probabilities before and after enhancement. ### Summary The core objective of the paper is to improve the learning effect of the student model in the knowledge distillation process by enhancing the magnitude and diversity of non - target classes. The experimental results show that the NTCE - KD method is significantly superior to existing methods on the CIFAR - 100 and ImageNet - 1k datasets, especially performing excellently under multiple teacher - student model combinations.

NTCE-KD: Non-Target-Class-Enhanced Knowledge Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

Improving Knowledge Distillation Via Head and Tail Categories

Rethinking Knowledge Distillation Via Cross-Entropy

Class-aware Information for Logit-based Knowledge Distillation

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Boosting Knowledge Distillation Via Intra-class Logit Distribution Smoothing

Knowledge Condensation Distillation

Adaptive Explicit Knowledge Transfer for Knowledge Distillation

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy

Multi-target Knowledge Distillation Via Student Self-reflection

Improving Knowledge Distillation With a Customized Teacher

Adaptive Multi-Teacher Multi-level Knowledge Distillation

Channel Distillation: Channel-Wise Attention for Knowledge Distillation

Adaptive Cross-Architecture Mutual Knowledge Distillation

Collaborative Knowledge Distillation

Revisiting Knowledge Distillation Via Label Smoothing Regularization

Online Knowledge Distillation via Collaborative Learning

Knowledge Augmentation for Distillation: A General and Effective Approach to Enhance Knowledge Distillation