Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

Yaqi Hu,Dongyuan Ou,Xiaoxu Wang,Rong Yu
DOI: https://doi.org/10.1109/ithings-greencom-cpscom-smartdata-cybermatics60724.2023.00083
2024-01-01
Abstract:In the field of autonomous driving, Visual-and-Language Navigation (VLN) is a typical multimodal task. In the VLN task, an intelligent vehicle needs to find the target location based on user-provided navigation instructions. However, conventional VLN models generally face the problem of limited generalization ability when dealing with a large number of real-world environmental objects and language instructions. This paper proposed a novel VLN system based on large-scale pre-trained models and applied it to intelligent vehicles. The method consists of an Instruction Extraction System, a Vision-Language Association System, and a Navigational Decisions System. Specifically, a pre-trained Large Language Model (LLM) is first used to extract a series of landmark names from the user’s natural language instructions. Then, the landmark name list is inputted into a pre-trained Visual-Language Model (VLM) to infer the joint probability with environmental objects. Therefore, the image nodes that match the landmarks have been selected. Additionally, the selected image nodes are inputted into another VLM to obtain descriptions of the image nodes. Finally, LLM is used to reason navigation actions for the intelligent vehicle. With the reasoning ability of LLM, the intelligent vehicle takes navigation knowledge, visual environment descriptions, and navigation history as inputs to output navigation actions. The simulation results demonstrate that compared to other fully supervised learning methods, this approach exhibits better generalization ability in unknown environments. Based on Google Map’s Street View data, it achieves a 14.6% higher task success rate compared to the baseline model VLN Transformer.
What problem does this paper attempt to address?