Pixel-Wise Contrastive Distillation

Junqiang Huang,Zichao Guo

2024-04-16

Abstract:We present a simple but effective pixel-level self-supervised distillation framework friendly to dense prediction tasks. Our method, called Pixel-Wise Contrastive Distillation (PCD), distills knowledge by attracting the corresponding pixels from student's and teacher's output feature maps. PCD includes a novel design called SpatialAdaptor which ``reshapes'' a part of the teacher network while preserving the distribution of its output features. Our ablation experiments suggest that this reshaping behavior enables more informative pixel-to-pixel distillation. Moreover, we utilize a plug-in multi-head self-attention module that explicitly relates the pixels of student's feature maps to enhance the effective receptive field, leading to a more competitive student. PCD \textbf{outperforms} previous self-supervised distillation methods on various dense prediction tasks. A backbone of \mbox{ResNet-18-FPN} distilled by PCD achieves $37.4$ AP$^\text{bbox}$ and $34.0$ AP$^\text{mask}$ on COCO dataset using the detector of \mbox{Mask R-CNN}. We hope our study will inspire future research on how to pre-train a small model friendly to dense prediction tasks in a self-supervised fashion.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of poor performance of small models in dense prediction tasks (such as object detection and semantic segmentation) in self-supervised learning (SSL). Specifically: 1. **Limitations of existing methods**: Current self-supervised distillation methods mainly rely on image-level guidance signals and do not fully utilize pixel-level knowledge. This makes it difficult for small models to inherit knowledge from teacher models that is beneficial for dense prediction tasks. 2. **Importance of pixel-level knowledge**: The paper proposes a new pixel-level self-supervised distillation framework—Pixel-Wise Contrastive Distillation (PCD), aimed at improving the performance of small models in dense prediction tasks through pixel-level knowledge transfer. 3. **Design of Spatial Adaptor**: To better utilize the knowledge from teacher models, the paper introduces a Spatial Adaptor to adapt the teacher models, which are usually pre-trained by image-level SSL methods, to handle 2D feature maps without changing the output feature distribution. 4. **Application of Multi-Head Self-Attention (MHSA) module**: To enhance the Effective Receptive Field (ERF) of the small model, the paper incorporates a Multi-Head Self-Attention module in the student model, thereby slightly expanding its receptive field and further improving its performance. Through these improvements, PCD achieves significant performance gains in multiple dense prediction tasks and demonstrates stronger generalization capabilities compared to existing self-supervised learning methods. Additionally, experimental results show that PCD outperforms existing self-supervised distillation methods across different downstream tasks.

Pixel-Wise Contrastive Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

Self-Paced Knowledge Distillation for Real-Time Image Guided Depth Completion

Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition

DCD: Discriminative and Consistent Representation Distillation

Pixel Distillation: Cost-flexible Distillation Across Image Sizes and Heterogeneous Networks

Knowledge Distillation Meets Self-Supervision

Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation

Tolerant Self-Distillation for Image Classification

Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones

Masked Distillation with Receptive Tokens

Channel-wise Knowledge Distillation for Dense Prediction

DenseCL: A Simple Framework for Self-Supervised Dense Visual Pre-Training

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

DFD: Distillng the Feature Disparity Differently for Detectors

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

Adaptive Perspective Distillation for Semantic Segmentation

Knowledge Distillation with Deep Supervision

CDHD: Contrastive Dreamer for Hint Distillation

Conditional Pseudo-Supervised Contrast for Data-Free Knowledge Distillation

Teacher-Student Complementary Sample Contrastive Distillation