Abstract:Deep learning has contributed greatly to many successes in artificial intelligence in recent years. Today, it is possible to train models that have thousands of layers and hundreds of billions of parameters. Large-scale deep models have achieved great success, but the enormous computational complexity and gigantic storage requirements make it extremely difficult to implement them in real-time applications. On the other hand, the size of the dataset is still a real problem in many domains. Data are often missing, too expensive, or impossible to obtain for other reasons. Ensemble learning is partially a solution to the problem of small datasets and overfitting. However, ensemble learning in its basic version is associated with a linear increase in computational complexity. We analyzed the impact of the ensemble decision-fusion mechanism and checked various methods of sharing the decisions including voting algorithms. We used the modified knowledge distillation framework as a decision-fusion mechanism which allows in addition compressing of the entire ensemble model into a weight space of a single model. We showed that knowledge distillation can aggregate knowledge from multiple teachers in only one student model and, with the same computational complexity, obtain a better-performing model compared to a model trained in the standard manner. We have developed our own method for mimicking the responses of all teachers at the same time, simultaneously. We tested these solutions on several benchmark datasets. In the end, we presented a wide application use of the efficient multi-teacher knowledge distillation framework. In the first example, we used knowledge distillation to develop models that could automate corrosion detection on aircraft fuselage. The second example describes detection of smoke on observation cameras in order to counteract wildfires in forests.

Precision-Mixed and Weight-Average Ensemble: Online Knowledge Distillation for Quantization Convolutional Neural Networks

DCCD: Reducing Neural Network Redundancy Via Distillation

Self-Paced Knowledge Distillation for Real-Time Image Guided Depth Completion

DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy

Stochastic Precision Ensemble: Self-Knowledge Distillation for Quantized Deep Neural Networks

Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation

Learn by Oneself: Exploiting Weight-Sharing Potential in Knowledge Distillation Guided Ensemble Network

Channel Distillation: Channel-Wise Attention for Knowledge Distillation

Channel-wise Knowledge Distillation for Dense Prediction

BD-KD: Balancing the Divergences for Online Knowledge Distillation

CDFKD-MFS: Collaborative Data-free Knowledge Distillation Via Multi-level Feature Sharing

Online Knowledge Distillation via Collaborative Learning

Knowledge Probabilization in Ensemble Distillation: Improving Accuracy and Uncertainty Quantification for Object Detectors

Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks

Decoupled Knowledge with Ensemble Learning for Online Distillation

CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing

Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks

Online Knowledge Distillation via Multi-branch Diversity Enhancement

Distilling the Knowledge in a Neural Network

PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation

Adaptive Cross-Architecture Mutual Knowledge Distillation