Abstract:Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research, which poses requirements on task planning, environment modeling, and object interaction. In this work, we study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls. In particular, DISCO incorporates differentiable scene representations of rich semantics in object and affordance, which is dynamically learned on the fly and facilitates navigation planning. Besides, we propose dual-level coarse-to-fine action controls leveraging both global and local cues to accomplish mobile manipulation tasks efficiently. DISCO easily integrates into embodied tasks such as embodied instruction following. To validate our approach, we take the ALFRED benchmark of large-scale long-horizon vision-language navigation and interaction tasks as a test bed. In extensive experiments, we make comprehensive evaluations and demonstrate that DISCO outperforms the art by a sizable +8.6% success rate margin in unseen scenes, even without step-by-step instructions. Our code is publicly released at <a class="link-external link-https" href="https://github.com/AllenXuuu/DISCO" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to build an embodied AI agent for intelligent home assistants that can perform long - term household tasks in complex, unstructured real - world environments, enabling it to complete navigation and interaction tasks according to human instructions. Specifically, the research focuses on basic mobile operation tasks, that is, how to navigate and interact based on given verb - noun pairs. ### Core Problems of the Paper 1. **Task Planning**: How to parse human instructions and formulate corresponding plans. 2. **Environmental Modeling**: How to perceive the surrounding environment, locate semantic entities, and navigate to target locations. 3. **Object Interaction**: How to effectively interact with objects, especially in the absence of step - by - step instructions. ### DISCO's Solutions To solve the above problems, the paper proposes DISCO (DIfferentiable Scene Semantics and Dual - level COntrol), and its main contributions include: 1. **Differentiable Scene Semantic Representation**: - DISCO learns dynamic scene representations that contain rich object and functional semantics and can be updated in real - time. - Scene representations are optimized by gradient descent to match local point - cloud semantics. - Use the formula \( p_{i,j} = f(s_i, q_j) = \sigma(s_i^T q_j) \), where \( s_i \) is the scene representation of the \( i \) - th grid, \( q_j \) is the \( j \) - th semantic query vector, and \(\sigma\) is the sigmoid function. 2. **Two - level Coarse - and - Fine Control**: - **Coarse Control**: Navigate based on global cues (such as scene maps) to make the agent approach the target object. - **Fine Control**: Make fine - tuned adjustments based on local visual frames (such as RGB images, depth estimates, and object masks) to achieve efficient object interaction. - Coarse control relies on the global semantic map to approach the target, while fine control uses neural policies to adjust postures and operate on objects according to local visual information. 3. **Application on the ALFRED Benchmark**: - Extensive experiments were carried out on the ALFRED benchmark, showing that DISCO significantly outperforms existing methods in unseen scenes, especially in the absence of step - by - step instructions. - Experimental results show that in unseen scenes, the success rate of DISCO is 8.6% higher than that of the best existing method. ### Summary By introducing the differentiable scene semantic representation and the two - level control mechanism, DISCO has solved the challenges of navigation and interaction according to human instructions in complex environments, especially performing exceptionally well in the absence of step - by - step guidance. This provides new ideas and technical means for building general - purpose home assistant agents.

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Embodied Instruction Following in Unknown Environments

Embodied Multi-Agent Task Planning from Ambiguous Instruction

Learning to Act with Affordance-Aware Multimodal Neural SLAM

Egocentric Planning for Scalable Embodied Task Achievement

Enhancing Socially-Aware Robot Navigation through Bidirectional Natural Language Conversation

DisCo: Disentangled Control for Realistic Human Dance Generation

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

DOROTHIE: Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents

Scene-Intuitive Agent for Remote Embodied Visual Grounding

DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Learning Distilled Collaboration Graph for Multi-Agent Perception

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments.

CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

CoNav: A Benchmark for Human-Centered Collaborative Navigation

Out of the Box: Embodied Navigation in the Real World

DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following

Exploitation-Guided Exploration for Semantic Embodied Navigation

DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

HRL4IN: Hierarchical Reinforcement Learning for Interactive Navigation with Mobile Manipulators

Learning to Look: Seeking Information for Decision Making via Policy Factorization