Abstract:Knowledge distillation in machine learning is the process of transferring knowledge from a large model called the teacher to a smaller model called the student. Knowledge distillation is one of the techniques to compress the large network (teacher) to a smaller network (student) that can be deployed in small devices such as mobile phones. When the network size gap between the teacher and student increases, the performance of the student network decreases. To solve this problem, an intermediate model is employed between the teacher model and the student model known as the teaching assistant model, which in turn bridges the gap between the teacher and the student. In this research, we have shown that using multiple teaching assistant models, the student model (the smaller model) can be further improved. We combined these multiple teaching assistant models using weighted ensemble learning where we have used a differential evaluation optimization algorithm to generate the weight values.

What problem does this paper attempt to address?

Models. The main aim of this method is to improve the performance of the student model by using multiple teaching assistant models and combining their predictions using weighted ensemble learning. ### Experimental Results #### Results of Traditional Knowledge Distillation Before discussing the comparison between our proposed method and traditional knowledge distillation, we first discuss the general classification performance of the student model. The performance of the independent student model is shown in Table I, and the visual representation of the results is shown in Figure 9. **Table I: Performance of the Independent Student Model** | Model | Dataset | Independent Student Accuracy | | ---- | ---- | ---- | | CNN | CIFAR - 10 | 48.01% | | CNN | CIFAR - 100 | 40.18% | | CNN | MNIST | 85.41% | ![Figure 9: Accuracy of the Independent Student Model](fig9.png) Next, we implement the traditional knowledge distillation architecture. Using the proposed teacher and student architectures, we implement the baseline knowledge distillation on the proposed datasets. The results are shown in Table II and Figure 10. By comparing Figure 9 and Figure 10, it can be clearly seen that the performance of the student model is improved after using traditional knowledge distillation, which is the expected result. **Table II: Performance of the Student Model Using Baseline Knowledge Distillation** | Model | Dataset | Traditional Knowledge Distillation (Student Accuracy) | | ---- | ---- | ---- | | CNN | CIFAR - 10 | 49.32% | | CNN | CIFAR - 100 | 41.21% | | CNN | MNIST | 86.59% | ![Figure 10: Teacher - Student Model Accuracy](fig10.png) In the next stage, our goal is to use a single teaching assistant model and analyze its results in comparison with the previous two results. We measure the student performance when using a teaching assistant model between the teacher and student models. In the experiment, a 6 - layer teaching assistant model is used, and the experiment is carried out on the proposed datasets. The results are shown in Table III, and the accuracy is shown in Figure 11. From the results, it can be seen that by introducing a teaching assistant model between the teacher and student models, the performance of the student model is improved. **Table III: Performance of the Student Model Using a Single Teaching Assistant Model** | Model | Dataset | Single Teaching Assistant Model (Student Accuracy) | | ---- | ---- | ---- | | CNN | CIFAR - 10 | 50.21% | | CNN | CIFAR - 100 | 42.08% | | CNN | MNIST | 87.73% | ![Figure 11: Accuracy of the Student Model Using a Teaching Assistant Model](fig11.png) #### Results of Our Proposed Method In this part, we will discuss our proposed method, that is, the integration of teaching assistant models. The main purpose of this method is to improve the performance of the student model by using multiple teaching assistant models and combining their predictions. **Experimental Setup** 1. **Dataset**: The experiment is carried out on the CIFAR - 10, CIFAR - 100 and MNIST datasets. 2. **Network Architecture**: - Teacher model: 10 layers - Student model: 2 layers - Teaching assistant model: 5

Knowledge Distillation via Weighted Ensemble of Teaching Assistants

Densely Guided Knowledge Distillation using Multiple Teacher Assistants

Reinforced Multi-Teacher Selection for Knowledge Distillation

Learn From the Past: Experience Ensemble Knowledge Distillation

Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks

A Survey on Recent Teacher-student Learning Studies

Improved Knowledge Distillation via Teacher Assistant

Knowledge Distillation: A Survey

Unified and Effective Ensemble Knowledge Distillation

Knowledge Distillation in Deep Learning and its Applications

Deeply-Supervised Knowledge Distillation

Heterogeneous Student Knowledge Distillation From BERT Using a Lightweight Ensemble Framework

SAKD: Sparse attention knowledge distillation

Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

TC<SUP>3</SUP>KD: Knowledge distillation via teacher-student cooperative curriculum customization

Differentiable Dynamic Channel Association for Knowledge Distillation

Knowledge Distillation with Deep Supervision

Learning from a Lightweight Teacher for Efficient Knowledge Distillation

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

ResKD: Residual-Guided Knowledge Distillation

Adaptive Multi-Teacher Multi-level Knowledge Distillation