Abstract:Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-trained teacher model is not available. However, the existing methods mostly focus on improving the ensemble prediction accuracy from multiple students (a.k.a. branches), which often overlook the homogenization problem that makes student model saturate quickly and hurts the performance. We assume that the intrinsic bottleneck of the homogenization problem comes from the identical branch architecture and coarse ensemble strategy. We propose a novel Adaptive Hierarchy-Branch Fusion framework for Online Knowledge Distillation, termed AHBF-OKD, which designs hierarchical branches and adaptive hierarchy-branch fusion module to boost the model diversity and aggregate complementary knowledge. Specifically, we first introduce hierarchical branch architectures to construct diverse peers by increasing the depth of branches monotonously on the basis of target branch. To effectively transfer knowledge from the most complex branch to the simplest target branch, we propose an adaptive hierarchy-branch fusion module to create hierarchical teacher assistants recursively, which regards the target branch as the smallest teacher assistant. During the training, the teacher assistant from the previous hierarchy is explicitly distilled by the teacher assistant and the branch from the current hierarchy. Thus, the important scores to different branches are effectively and adaptively allocated to reduce the branch homogenization. Extensive experiments demonstrate the effectiveness of AHBF-OKD on different datasets, including CIFAR-10/100 and ImageNet 2012. For example, on ImageNet 2012, the distilled ResNet-18 achieves Top-1 error of 29.28\%, which significantly outperforms the state-of-the-art methods. The source code is available at https://github.com/linruigong965/AHBF.

Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

Attention-based Feature Interaction for Efficient Online Knowledge Distillation.

Online Knowledge Distillation via Multi-branch Diversity Enhancement

Online Knowledge Distillation with Diverse Peers

Deep Cross-Layer Collaborative Learning Network for Online Knowledge Distillation

One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation

Diversified Branch Fusion for Self-Knowledge Distillation

Decoupled Knowledge with Ensemble Learning for Online Distillation

Adaptable Ensemble Distillation.

Semi-Online Knowledge Distillation

DFEF: Diversify feature enhancement and fusion for online knowledge distillation

Adaptive multi-teacher multi-level knowledge distillation

Switchable Online Knowledge Distillation

Online Knowledge Distillation Via Collaborative Learning with Enhanced Diversity and Gradual Ensemble

Online Knowledge Distillation via Collaborative Learning

Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation

Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution

Peer Collaborative Learning for Online Knowledge Distillation

Hybrid mix-up contrastive knowledge distillation