CrossKD: Cross-Head Knowledge Distillation for Object Detection

Jiabao Wang,Yuming Chen,Zhaohui Zheng,Xiang Li,Ming-Ming Cheng,Qibin Hou
2024-04-15
Abstract:Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper, we present a general and effective prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head. The resulting cross-head predictions are then forced to mimic the teacher's predictions. This manner relieves the student's head from receiving contradictory supervision signals from the annotations and the teacher's predictions, greatly improving the student's detection performance. Moreover, as mimicking the teacher's predictions is the target of KD, CrossKD offers more task-oriented information in contrast with feature imitation. On MS COCO, with only prediction mimicking losses applied, our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods. In addition, our method also works well when distilling detectors with heterogeneous backbones. Code is available at <a class="link-external link-https" href="https://github.com/jbwang1997/CrossKD" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the target conflict problem existing in the knowledge distillation (KD) process in object detection. Specifically, when the traditional prediction mimicking method transfers knowledge between the student model and the teacher model, it will encounter learning target conflicts caused by the inconsistency between the target assigner of the student model and the teacher model. This conflict makes the student model generate contradictions when receiving supervision signals from the ground - truth targets and the teacher's predictions, which affects the optimization process and final performance of the model. To solve this problem, the paper proposes a new Cross - Head Knowledge Distillation (CrossKD) method. CrossKD generates cross - head predictions by passing the features of the intermediate layer of the student model to the detection head of the teacher model, and then forces these cross - head predictions to mimic the predictions of the teacher model. This method not only alleviates the target conflict problem and improves the effectiveness of prediction mimicking, but also can provide more task - oriented information, thus achieving better performance improvement than existing methods in object detection tasks. For example, on the MS COCO dataset, using only the prediction mimicking loss, CrossKD increases the Average Precision (AP) of the GFL ResNet - 50 model from 40.2 to 43.7, surpassing all existing KD methods. In addition, experiments also show that CrossKD can be orthogonally combined with the feature imitation method to further improve the model performance.