RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation

Zeyuan Yang,LIU JIAGENG,Peihao Chen,Anoop Cherian,Tim Marks,Jonathan Le Roux,Chuang Gan
DOI: https://doi.org/10.1109/cvpr52733.2024.01538
2024-01-01
Abstract:We leverage Large Language Models (LLM) for zero-shot Semantic Audio Visual Navigation (SAVN). Existing methods utilize extensive training demonstrations for rein-forcement learning, yet achieve relatively low success rates and lack generalizability. The intermittent nature of au-ditory signals further poses additional obstacles to infer-ring the goal information. To address this challenge, we present the Reflective and Imaginative Language Agent (RILA). By employing multi-modal models to process sen-sory data, we instruct an LLM-based planner to actively ex-plore the environment. During the exploration, our agent adaptively evaluates and dismisses inaccurate perceptual descriptions. Additionally, we introduce an auxiliary LLM-based assistant to enhance global environmental compre-hension by mapping room layouts and providing strategic insights. Through comprehensive experiments and analy-sis, we show that our method outperforms relevant base-lines without training demonstrations from the environment and complementary semantic information.
What problem does this paper attempt to address?