MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

Junyou Zhu,Yanyuan Qiao,Siqi Zhang,Xingjian He,Qi Wu,Jing Liu

2024-09-27

Abstract:In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12% of the teacher model's parameter count.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to significantly reduce the number of model parameters while maintaining high performance in the Vision - and - Language Navigation (VLN) task, so as to be deployed on resource - constrained devices. Specifically, although existing VLN models have made significant progress in performance, these models are often computationally complex and require a large amount of memory and processing power, which limits their applications in real - time or resource - constrained scenarios. To meet this challenge, the paper proposes a two - stage knowledge distillation framework. By extracting knowledge from large teacher models to train small student models, an efficient and lightweight VLN model, MiniVLN, is achieved. This method aims to capture fine - grained knowledge and navigation - specific knowledge through knowledge distillation in the pre - training stage and the fine - tuning stage, in order to narrow the performance gap between the student model and the teacher model. Experimental results show that with only about 12% of the number of parameters of the teacher model, MiniVLN can achieve performance comparable to or even better than that of the teacher model on multiple benchmark datasets.

MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation

Knowledge distilled pre-training model for vision-language-navigation

Real-time Vision-Language-Navigation based on a Lite Pre-training Model

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Vision-Language Navigation with Continual Learning

VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition

Patient teacher can impart locality to improve lightweight vision transformer on small dataset

Layerwised multimodal knowledge distillation for vision-language pretrained model

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

VLM-Vac: Enhancing Smart Vacuums through VLM Knowledge Distillation and Language-Guided Experience Replay

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

Curriculum Learning for Vision-and-Language Navigation