Abstract:Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: (i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that only use uni-directional interaction.

What problem does this paper attempt to address?

This paper attempts to address the problem of efficient audio-visual navigation in noisy environments, particularly when audio signals are intermittent or when there is significant environmental noise. Specifically, the authors propose a conversation-based audio-visual entity navigation framework (CA VEN) that helps agents complete navigation tasks by interacting with humans/oracles. The CA VEN model can predict possible trajectories based on audio and visual cues and generate natural language questions to communicate with the oracle to obtain navigation instructions. Additionally, the model can interpret free-form, potentially noisy responses after receiving feedback from the oracle to guide navigation. ### Main Issues 1. **Challenges of Audio-Visual Navigation**: In noisy environments, audio signals may be intermittent and mixed with other sounds, making it very difficult to navigate relying solely on audio and visual cues. 2. **Need for Interactive Navigation**: Existing methods mostly consider one-way interaction (i.e., the agent can only request help but cannot ask questions), lacking the ability for two-way natural language dialogue, which limits the agent's effectiveness. 3. **Budget Constraints**: In practical applications, agents cannot frequently request help from the oracle, so a mechanism is needed to decide when and how to interact with the oracle to reduce dependency. ### Solutions - **CA VEN Framework**: Combines audio, visual, and language modalities to interact with the oracle through two-way natural language dialogue, improving navigation success rates. - **Multimodal Navigation Strategy**: Includes audio-visual navigation strategy, instruction-based navigation strategy, and two-way question-answer navigation strategy. - **Reinforcement Learning Framework**: Uses partially observable semi-Markov decision processes (POSMDP) and dynamic reward mechanisms to train agents to switch between different navigation modes, optimizing navigation efficiency and success rates. ### Experimental Results - Experiments conducted on the SoundSpaces simulator show that the CA VEN model significantly improves navigation success rates in various noisy environments, especially when locating new sound sources. - Compared to existing methods, the CA VEN model achieves nearly an order of magnitude improvement in success rates. ### Core Contributions 1. **Two-Way Interaction Capability**: For the first time, enables free-form natural language two-way dialogue between the agent and the oracle, improving communication efficiency in navigation tasks. 2. **Innovative Module Design**: Introduces trajectory prediction module, question generation module, and question decoding module, supporting the agent in generating and interpreting natural language questions. 3. **Budget-Aware Reinforcement Learning Strategy**: Designs a new budget-aware and uncertainty-segmented reinforcement learning strategy, allowing the agent to navigate effectively within a limited number of interactions. 4. **Large-Scale Dataset**: Proposes a new audio-visual-language navigation sub-instruction dataset AVN-Instruct for pre-training entity navigation models and introduces two new evaluation metrics SNO and SNI. Through these innovations, the CA VEN model demonstrates strong navigation capabilities in complex real-world environments, providing new directions for future entity navigation research.

CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments

AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

Sound Adversarial Audio-Visual Navigation

Audio Visual Language Maps for Robot Navigation

Echo-Enhanced Embodied Visual Navigation

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

ANAVI: Audio Noise Awareness using Visuals of Indoor environments for NAVIgation

Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation

Multi-goal Audio-visual Navigation using Sound Direction Map

OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

Object-and-Action Aware Model for Visual Language Navigation

Pay Self-Attention to Audio-Visual Navigation

Towards Versatile Embodied Navigation

Knowledge-driven Scene Priors for Semantic Audio-Visual Embodied Navigation

Active Visual Information Gathering for Vision-Language Navigation

Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments

Continual Vision-and-Language Navigation

Language-guided Navigation Via Cross-Modal Grounding and Alternate Adversarial Learning

RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation