Distilling from Similar Tasks for Transfer Learning on a Budget

Kenneth Borup,Cheng Perng Phoo,Bharath Hariharan
2023-04-25
Abstract:We address the challenge of getting efficient yet accurate recognition systems with limited labels. While recognition models improve with model size and amount of data, many specialized applications of computer vision have severe resource constraints both during training and inference. Transfer learning is an effective solution for training with few labels, however often at the expense of a computationally costly fine-tuning of large base models. We propose to mitigate this unpleasant trade-off between compute and accuracy via semi-supervised cross-domain distillation from a set of diverse source models. Initially, we show how to use task similarity metrics to select a single suitable source model to distill from, and that a good selection process is imperative for good downstream performance of a target model. We dub this approach DistillNearest. Though effective, DistillNearest assumes a single source model matches the target task, which is not always the case. To alleviate this, we propose a weighted multi-source distillation method to distill multiple source models trained on different domains weighted by their relevance for the target task into a single efficient model (named DistillWeighted). Our methods need no access to source data, and merely need features and pseudo-labels of the source models. When the goal is accurate recognition under computational constraints, both DistillNearest and DistillWeighted approaches outperform both transfer learning from strong ImageNet initializations as well as state-of-the-art semi-supervised techniques such as FixMatch. Averaged over 8 diverse target tasks our multi-source method outperforms the baselines by 5.6%-points and 4.5%-points, respectively.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of training efficient and accurate recognition systems with limited annotated data and computational resources. Specifically, while large models and extensive data can improve recognition accuracy, in many medical (such as X-ray analysis) and scientific applications (such as satellite image analysis), both the annotated training data and the computational resources required to train large models are extremely limited. To tackle this challenge, the paper proposes two methods to avoid computationally expensive fine-tuning from large-scale base models: 1. **DISTILL NEAREST**: By calculating the "task similarity" between the target task and each source model, the most similar single source model is selected for knowledge distillation. This method assumes that there is an optimal source model whose knowledge can be directly distilled onto the target model. 2. **DISTILL WEIGHTED**: Considering that there may not be a single source model that perfectly matches the target task, a weighted multi-source distillation approach is proposed. It involves distilling knowledge from multiple source models trained in different domains, which are weighted according to their relevance to the target task, ultimately merging into an efficient model. Both methods do not require access to the source data, only the features and pseudo-labels of the source models. When the goal is to achieve accurate recognition under computational resource constraints, both DISTILL NEAREST and DISTILL WEIGHTED methods outperform transfer learning from a strong ImageNet initialization and state-of-the-art semi-supervised techniques such as FixMatch. On 8 different target tasks, the multi-source method averaged an accuracy improvement of 5.6% and 4.5% over the baselines. The main contributions of the paper include: - Training over 200 models, demonstrating the importance of source model selection for the predictive performance of the target model. - Finding that the task similarity metric correlates with predictive performance, which can be used to effectively select and weight source models for single-source or multi-source distillation without accessing any source data. - Proving that the methods achieve optimal accuracy on multiple target tasks under computational and data constraints.