Abstract:We address the challenge of getting efficient yet accurate recognition systems with limited labels. While recognition models improve with model size and amount of data, many specialized applications of computer vision have severe resource constraints both during training and inference. Transfer learning is an effective solution for training with few labels, however often at the expense of a computationally costly fine-tuning of large base models. We propose to mitigate this unpleasant trade-off between compute and accuracy via semi-supervised cross-domain distillation from a set of diverse source models. Initially, we show how to use task similarity metrics to select a single suitable source model to distill from, and that a good selection process is imperative for good downstream performance of a target model. We dub this approach DistillNearest. Though effective, DistillNearest assumes a single source model matches the target task, which is not always the case. To alleviate this, we propose a weighted multi-source distillation method to distill multiple source models trained on different domains weighted by their relevance for the target task into a single efficient model (named DistillWeighted). Our methods need no access to source data, and merely need features and pseudo-labels of the source models. When the goal is accurate recognition under computational constraints, both DistillNearest and DistillWeighted approaches outperform both transfer learning from strong ImageNet initializations as well as state-of-the-art semi-supervised techniques such as FixMatch. Averaged over 8 diverse target tasks our multi-source method outperforms the baselines by 5.6%-points and 4.5%-points, respectively.

What problem does this paper attempt to address?

The paper aims to address the issue of training efficient and accurate recognition systems with limited annotated data and computational resources. Specifically, while large models and extensive data can improve recognition accuracy, in many medical (such as X-ray analysis) and scientific applications (such as satellite image analysis), both the annotated training data and the computational resources required to train large models are extremely limited. To tackle this challenge, the paper proposes two methods to avoid computationally expensive fine-tuning from large-scale base models: 1. **DISTILL NEAREST**: By calculating the "task similarity" between the target task and each source model, the most similar single source model is selected for knowledge distillation. This method assumes that there is an optimal source model whose knowledge can be directly distilled onto the target model. 2. **DISTILL WEIGHTED**: Considering that there may not be a single source model that perfectly matches the target task, a weighted multi-source distillation approach is proposed. It involves distilling knowledge from multiple source models trained in different domains, which are weighted according to their relevance to the target task, ultimately merging into an efficient model. Both methods do not require access to the source data, only the features and pseudo-labels of the source models. When the goal is to achieve accurate recognition under computational resource constraints, both DISTILL NEAREST and DISTILL WEIGHTED methods outperform transfer learning from a strong ImageNet initialization and state-of-the-art semi-supervised techniques such as FixMatch. On 8 different target tasks, the multi-source method averaged an accuracy improvement of 5.6% and 4.5% over the baselines. The main contributions of the paper include: - Training over 200 models, demonstrating the importance of source model selection for the predictive performance of the target model. - Finding that the task similarity metric correlates with predictive performance, which can be used to effectively select and weight source models for single-source or multi-source distillation without accessing any source data. - Proving that the methods achieve optimal accuracy on multiple target tasks under computational and data constraints.

Distilling from Similar Tasks for Transfer Learning on a Budget

DCCD: Reducing Neural Network Redundancy Via Distillation

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Task-Agnostic Self-Distillation for Few-Shot Action Recognition

Selective Cross-Task Distillation

On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones

Distillation of Diffusion Features for Semantic Correspondence

Training Task Experts through Retrieval Based Distillation

X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task Distillation

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Data Distillation: Towards Omni-Supervised Learning

Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation

Distilling Object Detectors With Fine-Grained Feature Imitation

ATOM: Attention Mixer for Efficient Dataset Distillation

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

Contrastive Representation Distillation