Learning to Follow and Generate Instructions for Language-Capable Navigation
Jiayi Shao,Xiaohan Wang,Wenguan Wang,Yi Yang
DOI: https://doi.org/10.1109/TPAMI.2023.3341828
IF: 23.6
2024-05-01
IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Visual-language navigation (VLN) is a challenging task that requires embodied agents to follow natural language instructions to navigate in previously unseen environments. However, existing literature put most emphasis on interpreting instructions into actions, only delivering “dumb” wayfinding agents which cannot actively use natural language to communicate with humans. In this article, we devise Lana, a language-capable navigation agent which is able to not only execute human-written navigation commands, but also provide route descriptions to humans. This is achieved by simultaneously learning instruction following and generation with only one single model. More specifically, two encoders, respectively for route and language encoding, are built and shared by two decoders, respectively, for action prediction and instruction generation, so as to exploit cross-task knowledge and capture task-specific characteristics. Throughout pretraining and fine-tuning, both instruction following and generation are set as optimization objectives. We further extend Lana by exploiting object semantics during route encoding. This leads to Lana+, a more powerful framework that simulates the way humans refer to landmarks for instructions composition and wayfinding. We empirically verify that, compared with recent advanced task-specific solutions, Lana attains better performances on both instruction following and generation, with nearly half complexity. In addition, endowed with language generation capability, Lana can explain to humans its behaviors and assist human's wayfinding. Benefiting from landmark information, Lana+ exhibits even more impressive performance. This work is expected to foster future efforts towards building more trustworthy and socially-intelligent navigation robots.
Computer Science