Real-time Vision-Language-Navigation based on a Lite Pre-training Model

Zhicai Shi,Jitao Huang,Liangqi Zhu,Guohui Zeng,Jin Liu,Bo Huang,Liyuan Ma
DOI: https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData-Cybermatics50389.2020.00077
2020-11-01
Abstract:Vision-Language-Navigation (VLN) is a challenging task that requires a robot to autonomously move to the destination based on visual observation following humans' natural language instructions. This paper presents a lite model based on the pre-training method, which can deal with real-time VLN task. Unlike previous traditional methods, our model achieves better performance and generalization thanks to adopting pre-training method. We introduce factorization and parameter sharing based on the PREVALENT model. These two lightweight approaches cause a 75% reduction of embedding parameters and a 77% reduction of the whole model parameters. About 17% of training time and 72.2% inference time are saved. At the same time, the performance of the original model was maintained, with a success rate (SR) and a success rate weighted by path length (SPL) consistent with the original model on the seen validation set (Seen Val) and a slight performance loss of about 1%-2% on the unseen validation set (Unseen Val).
Computer Science,Engineering
What problem does this paper attempt to address?