MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

Junyou Zhu,Yanyuan Qiao,Siqi Zhang,Xingjian He,Qi Wu,Jing Liu
2024-09-27
Abstract:In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12% of the teacher model's parameter count.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to significantly reduce the number of model parameters while maintaining high performance in the Vision - and - Language Navigation (VLN) task, so as to be deployed on resource - constrained devices. Specifically, although existing VLN models have made significant progress in performance, these models are often computationally complex and require a large amount of memory and processing power, which limits their applications in real - time or resource - constrained scenarios. To meet this challenge, the paper proposes a two - stage knowledge distillation framework. By extracting knowledge from large teacher models to train small student models, an efficient and lightweight VLN model, MiniVLN, is achieved. This method aims to capture fine - grained knowledge and navigation - specific knowledge through knowledge distillation in the pre - training stage and the fine - tuning stage, in order to narrow the performance gap between the student model and the teacher model. Experimental results show that with only about 12% of the number of parameters of the teacher model, MiniVLN can achieve performance comparable to or even better than that of the teacher model on multiple benchmark datasets.