Abstract:Knowledge distillation (KD), known for its ability to transfer knowledge from a cumbersome network (teacher) to a lightweight one (student) without altering the architecture, has been garnering increasing attention. Two primary categories emerge within KD methods: feature-based, focusing on intermediate layers' features, and logits-based, targeting the final layer's logits. This paper introduces a novel perspective by leveraging diverse knowledge sources within a unified KD framework. Specifically, we aggregate features from intermediate layers into a comprehensive representation, effectively gathering semantic information from different stages and scales. Subsequently, we predict the distribution parameters from this representation. These steps transform knowledge from the intermediate layers into corresponding distributive forms, thereby allowing for knowledge distillation through a unified distribution constraint at different stages of the network, ensuring the comprehensiveness and coherence of knowledge transfer. Numerous experiments were conducted to validate the effectiveness of the proposed method.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve the unification and effective transfer of different - level knowledge during the knowledge distillation process of neural networks. Specifically, existing knowledge distillation methods usually focus on a single type of knowledge (such as feature - based methods or logits - based methods), or directly mix two types of knowledge, but ignore the inconsistency between different - level knowledge. This leads to an unclear optimization objective, making it difficult for the student network to reach the optimal solution. To solve this problem, the paper proposes a new framework named Unified Knowledge Distillation (UniKD). The main contributions of UniKD include: 1. **Unified knowledge distillation**: UniKD realizes unified knowledge distillation across different network layers by fusing features at different levels into a comprehensive representation and converting it into a distribution form. This can ensure the comprehensiveness and coherence of knowledge transfer. 2. **Adaptive Feature Fusion module (AFF)**: The AFF module extracts features from intermediate layers, retains multi - scale information, and simplifies the calculation process at the same time. Through the gate mechanism, the AFF module can adaptively determine the importance of adjacent - layer features, thereby retaining key information and eliminating redundant information. 3. **Feature Distribution Prediction module (FDP)**: The FDP module estimates the distribution parameters of intermediate - layer features and transforms the distillation of feature knowledge into distribution - level constraints. In this way, consistent knowledge distillation can be achieved between the intermediate - layer and final - layer logits. 4. **Experimental verification**: The paper verifies the effectiveness of UniKD through extensive experiments on multiple datasets (such as CIFAR - 100, ImageNet, and MS - COCO). The experimental results show that UniKD performs well in different tasks and different network architectures, especially in heterogeneous architectures. In conclusion, this paper aims to solve the problem of inconsistent knowledge at different levels in existing knowledge distillation methods through the UniKD framework and achieve more efficient and coherent knowledge transfer.

Harmonizing knowledge Transfer in Neural Network with Unified Distillation

Distilling Holistic Knowledge with Graph Neural Networks

Attention and feature transfer based knowledge distillation

Collaborative Knowledge Distillation Via Multiknowledge Transfer.

Towards a Unified View of Affinity-Based Knowledge Distillation

Online Knowledge Distillation via Collaborative Learning

Revisiting Knowledge Distillation: an Inheritance and Exploration Framework

Knowledge Augmentation for Distillation: A General and Effective Approach to Enhance Knowledge Distillation

Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation

Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation

Knowledge Condensation Distillation

Respecting Transfer Gap in Knowledge Distillation

A Selective Survey on Versatile Knowledge Distillation Paradigm for Neural Network Models

Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture

A Closer Look at Knowledge Distillation with Features, Logits, and Gradients

Knowledge Representing: Efficient, Sparse Representation of Prior Knowledge for Knowledge Distillation

A Unified Asymmetric Knowledge Distillation Framework for Image Classification

Channel Distillation: Channel-Wise Attention for Knowledge Distillation

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Collaborative Knowledge Distillation