Layerwised multimodal knowledge distillation for vision-language pretrained model

Jin Wang,Dawei Liao,You Zhang,Dan Xu,Xuejie Zhang
DOI: https://doi.org/10.1016/j.neunet.2024.106272
Abstract:The transformer-based model can simultaneously learn the representation for both images and text, providing excellent performance for multimodal applications. Practically, the large scale of parameters may hinder its deployment in resource-constrained devices, creating a need for model compression. To accomplish this goal, recent studies suggest using knowledge distillation to transfer knowledge from a larger trained teacher model to a small student model without any performance sacrifice. However, this only works with trained parameters of the student model by using the last layer of the teacher, which makes the student model easily overfit in the distillation procedure. Furthermore, the mutual interference between modalities causes more difficulties for distillation. To address these issues, the study proposed a layerwised multimodal knowledge distillation for a vision-language pretrained model. In addition to the last layer, the intermediate layers of the teacher were also used for knowledge transfer. To avoid interference between modalities, we split the multimodality into separate modalities and added them as extra inputs. Then, two auxiliary losses were implemented to encourage each modality to distill more effectively. Comparative experiments on four different multimodal tasks show that the proposed layerwised multimodality distillation achieves better performance than other KD methods for vision-language pretrained models.
What problem does this paper attempt to address?