Abstract:As a specific form of knowledge distillation (KD), self-knowledge distillation enables a student network to progressively distill its own knowledge without relying on a pretrained, complex teacher network; however, recent studies of self-KD have discovered that additional dark knowledge captured by auxiliary architecture or data augmentation could create better soft targets for enhancing the network but at the cost of significantly more computations and/or parameters. Moreover, most existing self-KD methods extract the soft label as a supervisory signal from individual input samples, which overlooks the knowledge of relationships among categories. Inspired by human associative learning, we propose a simple yet effective self-KD method named associative learning for self-distillation (ALSD), which progressively distills richer knowledge regarding the relationships between categories across independent samples. Specifically, in the process of distillation, the propagation of knowledge is weighted based on the intersample relationship between associated samples generated in different minibatches, which are progressively estimated with the current network. In this way, our ALSD framework achieves knowledge ensembling progressively across multiple samples using a single network, resulting in minimal computational and memory overhead compared to existing ensembling methods. Extensive experiments demonstrate that our ALSD method consistently boosts the classification performance of various architectures on multiple datasets. Notably, ALSD pushes forward the self-KD performance to 80.10% on CIFAR-100, which exceeds the standard backpropagation by 4.81%. Furthermore, we observe that the proposed method shows comparable performance with the state-of-the-art knowledge distillation methods without the pretrained teacher network.

Representation Distillation for Efficient Self-Supervised Learning

Using Less but Important Information for Feature Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

Adaptive Similarity Bootstrapping for Self-Distillation based Representation Learning

Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones

Knowledge Distillation Meets Self-Supervision

Efficient Semantic Segmentation Via Self-Attention and Self-Distillation

Self-Supervised Dataset Distillation for Transfer Learning

DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning

Distilling a Powerful Student Model via Online Knowledge Distillation

DisCo: Remedying Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning

Self-Knowledge Distillation via Progressive Associative Learning

Siamese Image Modeling for Self-Supervised Vision Representation Learning

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

Contrastive Distillation Is a Sample-Efficient Self-Supervised Loss Policy for Transfer Learning

Restructuring the Teacher and Student in Self-Distillation

Self-Distillation: Towards Efficient and Compact Neural Networks

Contrastive Representation Distillation

Self-Distillation from the Last Mini-Batch for Consistency Regularization

DCD: Discriminative and Consistent Representation Distillation