Pixel-Wise Contrastive Distillation

Junqiang Huang,Zichao Guo
2024-04-16
Abstract:We present a simple but effective pixel-level self-supervised distillation framework friendly to dense prediction tasks. Our method, called Pixel-Wise Contrastive Distillation (PCD), distills knowledge by attracting the corresponding pixels from student's and teacher's output feature maps. PCD includes a novel design called SpatialAdaptor which ``reshapes'' a part of the teacher network while preserving the distribution of its output features. Our ablation experiments suggest that this reshaping behavior enables more informative pixel-to-pixel distillation. Moreover, we utilize a plug-in multi-head self-attention module that explicitly relates the pixels of student's feature maps to enhance the effective receptive field, leading to a more competitive student. PCD \textbf{outperforms} previous self-supervised distillation methods on various dense prediction tasks. A backbone of \mbox{ResNet-18-FPN} distilled by PCD achieves $37.4$ AP$^\text{bbox}$ and $34.0$ AP$^\text{mask}$ on COCO dataset using the detector of \mbox{Mask R-CNN}. We hope our study will inspire future research on how to pre-train a small model friendly to dense prediction tasks in a self-supervised fashion.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of poor performance of small models in dense prediction tasks (such as object detection and semantic segmentation) in self-supervised learning (SSL). Specifically: 1. **Limitations of existing methods**: Current self-supervised distillation methods mainly rely on image-level guidance signals and do not fully utilize pixel-level knowledge. This makes it difficult for small models to inherit knowledge from teacher models that is beneficial for dense prediction tasks. 2. **Importance of pixel-level knowledge**: The paper proposes a new pixel-level self-supervised distillation framework—Pixel-Wise Contrastive Distillation (PCD), aimed at improving the performance of small models in dense prediction tasks through pixel-level knowledge transfer. 3. **Design of Spatial Adaptor**: To better utilize the knowledge from teacher models, the paper introduces a Spatial Adaptor to adapt the teacher models, which are usually pre-trained by image-level SSL methods, to handle 2D feature maps without changing the output feature distribution. 4. **Application of Multi-Head Self-Attention (MHSA) module**: To enhance the Effective Receptive Field (ERF) of the small model, the paper incorporates a Multi-Head Self-Attention module in the student model, thereby slightly expanding its receptive field and further improving its performance. Through these improvements, PCD achieves significant performance gains in multiple dense prediction tasks and demonstrates stronger generalization capabilities compared to existing self-supervised learning methods. Additionally, experimental results show that PCD outperforms existing self-supervised distillation methods across different downstream tasks.