Abstract:We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal S$ over labels $\mathcal A$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{{|{\mathcal S}||{\mathcal A}|}/{n}}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${{|{\mathcal S}||{\mathcal A}|}/{n}}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on ${\mathcal A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${|{\mathcal S}|}/{n}$ free of $|{\mathcal A}|$. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **Maximize statistical efficiency through knowledge transfer on a finite field, and explore the impact of different levels of auxiliary information on the effect of knowledge transfer**. Specifically, the author studied the effectiveness of the probability classifier learned by the student model from the teacher model under three different levels of auxiliary information (hard labels, partial soft labels, and full soft labels). ### Main contributions of the paper: 1. **Defined three different levels of auxiliary information**: - **HardLabels**: Only samples and their corresponding labels. - **Partial SLs**: In addition to samples and labels, it also includes the probability of each sample label. - **SoftLabels**: In addition to samples and labels, it also includes the probability distribution (i.e., logits) of all labels for each sample. 2. **Analyzed the minimax convergence rate in each case**: - For hard labels, the minimax convergence rate is $\mathcal{O}\left(\sqrt{\frac{|S||A|}{n}}\right)$. - For partial soft labels, the minimax convergence rate is $\mathcal{O}\left(\frac{|S||A|}{n}\right)$, but using the traditional cross - entropy loss will lead to asymptotic bias. - For full soft labels, the minimax convergence rate is $\mathcal{O}\left(\frac{|S|}{n}\right)$, and the optimal effect can be achieved by minimizing the Kullback - Leibler divergence. 3. **Proposed a new loss function**: - In the case of partial soft labels, a new empirical squared error logit loss (Empirical Squared Error Logit Loss) was proposed to overcome the asymptotic bias problem of the traditional cross - entropy loss. 4. **Theoretical and experimental verification**: - Through theoretical analysis and numerical simulation, the influence of different levels of auxiliary information on the performance of the student model was verified, proving that more abundant auxiliary information can indeed accelerate the knowledge transfer process. ### Formula summary: - Minimax convergence rate for hard labels: \[ \inf_{\hat{\pi} \in \hat{\Pi}(D)} \sup_{\rho \times \pi^* \in \mathcal{P}} \mathbb{E}_{(\rho \times \pi^*)^n} \left[ TV(\hat{\pi}, \pi^* | \rho) \right] \gtrsim \sqrt{\frac{|S||A|}{n}} \] - Minimax convergence rate for partial soft labels: \[ \inf_{\hat{\pi} \in \hat{\Pi}(D, R)} \sup_{\rho \times \pi^* \in \mathcal{P}} \mathbb{E}_{(\rho \times \pi^*)^n} \left[ TV(\hat{\pi}, \pi^* | \rho) \right] \gtrsim \frac{|S||A|}{n} \] - Minimax convergence rate for full soft labels: \[ \inf_{\hat{\pi} \in \hat{\Pi}(D, Q)} \sup_{\rho \times \pi^* \in \mathcal{P}} \mathbb{E}_{(\rho \times \pi^*)^n} \left[ TV(\hat{\pi}, \pi^* | \rho) \right] \gtrsim \frac{|S|}{n} \] ### Conclusion: This paper, through strict theoretical analysis and experimental verification, reveals different levels of auxiliary

Towards the Fundamental Limits of Knowledge Transfer over Finite Domains

Knowledge Distillation Based on Transformed Teacher Matching

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Analysis of Knowledge Transfer in Kernel Regime

An Information-Theoretic Analysis for Transfer Learning: Error Bounds and Applications

Boosting Knowledge Distillation Via Intra-class Logit Distribution Smoothing

Exploring Dark Knowledge under Various Teacher Capacities and Addressing Capacity Mismatch

Logit Standardization in Knowledge Distillation

Respecting Transfer Gap in Knowledge Distillation

Zero-shot Knowledge Transfer via Adversarial Belief Matching

Knowledge Distillation Under Ideal Joint Classifier Assumption

Knowledge Distillation as Semiparametric Inference

Towards Understanding Knowledge Distillation

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

Distilling Knowledge via Intermediate Classifiers

Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again

Knowledge distillation with insufficient training data for regression

On the Efficacy of Knowledge Distillation

Faculty Distillation with Optimal Transport

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students