Scale Decoupled Distillation

Shicai Wei Chunbo Luo Yang Luo
2024-03-20
Abstract:Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end, we propose a simple but effective method, i.e., Scale Decoupled Distillation (SDD), for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover, the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity, respectively. By increasing the weight of complementary parts, SDD can guide the student to focus more on ambiguous samples, improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs, especially in the fine-grained classification task. Code is available at:
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in traditional logit - based knowledge distillation methods, when dealing with global logit knowledge, due to the integration of semantic information from different categories, it is difficult for the student network to inherit accurate semantic information, especially when dealing with ambiguous samples. Specifically, the paper points out that an entire image usually contains information of multiple categories, which may lead to classification errors. For example, two categories may belong to the same super - category and share similar global information; or a scene may contain information from multiple categories, resulting in a semantically - mixed logit output. This global logit output, which integrates diverse and fine - grained semantic knowledge, may transmit ambiguous knowledge to the student, mislead its learning, and lead to sub - optimal performance. To overcome this limitation, the authors propose the Scale Decoupled Distillation (SDD) method. SDD obtains richer and more explicit logit knowledge by decoupling logit outputs at the scale level, helping the student network to learn better. Specifically, SDD decomposes the global logit output into multiple local logit outputs and establishes distillation pipelines for these local outputs, enabling the student to mine and inherit fine - grained and explicit logit knowledge. In addition, the decoupled knowledge can be further divided into consistent and complementary logit knowledge, which transmit semantic information and sample ambiguity respectively. By increasing the weight of the complementary part, SDD can guide the student to pay more attention to ambiguous samples and improve its discrimination ability. The main contributions of the paper include: 1. Revealing a limitation of classical logit distillation, that is, the coupling of multi - category knowledge hinders the student from inheriting accurate semantic information of ambiguous samples. 2. Proposing a simple but effective method - SDD for logit knowledge distillation. SDD decouples the global logit output into consistent and complementary local logit outputs and establishes distillation pipelines to mine and transmit richer and more explicit semantic knowledge. 3. Conducting extensive experiments on multiple benchmark datasets, demonstrating the effectiveness of SDD in a wide range of teacher - student pairs, especially in fine - grained classification tasks. Through these contributions, the paper provides a new method for improving logit knowledge distillation, which helps to improve the performance of the student network when dealing with complex and ambiguous samples.