Abstract:Knowledge distillation (KD) has been applied to various tasks successfully, and mainstream methods typically boost the student model via spatial imitation losses. However, the consecutive downsamplings induced in the spatial domain of teacher model is a type of corruption, hindering the student from analyzing what specific information needs to be imitated, which results in accuracy degradation. To better understand the underlying pattern of corrupted feature maps, we shift our attention to the frequency domain. During frequency distillation, we encounter a new challenge: the low-frequency bands convey general but minimal context, while the high are more informative but also introduce noise. Not each pixel within the frequency bands contributes equally to the performance. To address the above problem: (1) We propose the Frequency Prompt plugged into the teacher model, absorbing the semantic frequency context during finetuning. (2) During the distillation period, a pixel-wise frequency mask is generated via Frequency Prompt, to localize those pixel of interests (PoIs) in various frequency bands. Additionally, we employ a position-aware relational frequency loss for dense prediction tasks, delivering a high-order spatial enhancement to the student model. We dub our Frequency Knowledge Distillation method as FreeKD, which determines the optimal localization and extent for the frequency distillation. Extensive experiments demonstrate that FreeKD not only outperforms spatial-based distillation methods consistently on dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys more robustness to the student. Notably, we also validate the generalization of our approach on large-scale vision models (e.g., DINO and SAM).

Channel-wise Knowledge Distillation for Dense Prediction

DCCD: Reducing Neural Network Redundancy Via Distillation

Structured Knowledge Distillation for Dense Prediction

Channel-wise Distillation for Semantic Segmentation.

Channel Distillation: Channel-Wise Attention for Knowledge Distillation

Online Knowledge Distillation via Collaborative Learning

Knowledge Distillation with Feature Maps for Image Classification

Generative Denoise Distillation: Simple Stochastic Noises Induce Efficient Knowledge Transfer for Dense Prediction

FreeKD: Knowledge Distillation via Semantic Frequency Prompt

Data Efficient Stagewise Knowledge Distillation

LAKD-Activation Mapping Distillation Based on Local Learning

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

'Parallel-Circuitized' distillation for dense object detection

FCKDNet: A Feature Condensation Knowledge Distillation Network for Semantic Segmentation

Knowledge Distillation with a Precise Teacher and Prediction with Abstention

Knowledge Distillation Via Channel Correlation Structure

ResKD: Residual-Guided Knowledge Distillation

An Embarrassingly Simple Approach for Knowledge Distillation

Spot-Adaptive Knowledge Distillation

Harmonizing knowledge Transfer in Neural Network with Unified Distillation