Knowledge distilled pre-training model for vision-language-navigation

Bo Huang,Shuai Zhang,Jitao Huang,Yijun Yu,Zhicai Shi,Yujie Xiong
DOI: https://doi.org/10.1007/s10489-022-03779-8
IF: 5.3
2022-06-30
Applied Intelligence
Abstract:Vision-language-navigation(VLN) is a challenging task that requires a robot to autonomously move to a destination based on visual observation following a human’s natural language instructions. To improve the performance and generalization ability, the pre-training model based on the transformer is used instead of the traditional methods. However, the pre-training model is not suitable for sustainable computing and practical application because of its complex computations and large amount of hardware occupation. Therefore, we propose a slight pre-training model through knowledge distillation. Through knowledge distillation, the plenty of knowledge encoded in a large “teacher” model can be well transferred to a small “student” model, which greatly reduces the model parameters and inference time while maintaining the original performance. In the experiments, the model size is reduced by 87%, and the average inference time is reduced by approximately 86%. It can be trained and run much faster. At the same time, 95% performance of the original model was maintained, which is still better than the traditional VLN models.
computer science, artificial intelligence
What problem does this paper attempt to address?