DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Xinyu Xu,Shengcheng Luo,Yanchao Yang,Yong-Lu Li,Cewu Lu
2024-07-20
Abstract:Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research, which poses requirements on task planning, environment modeling, and object interaction. In this work, we study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls. In particular, DISCO incorporates differentiable scene representations of rich semantics in object and affordance, which is dynamically learned on the fly and facilitates navigation planning. Besides, we propose dual-level coarse-to-fine action controls leveraging both global and local cues to accomplish mobile manipulation tasks efficiently. DISCO easily integrates into embodied tasks such as embodied instruction following. To validate our approach, we take the ALFRED benchmark of large-scale long-horizon vision-language navigation and interaction tasks as a test bed. In extensive experiments, we make comprehensive evaluations and demonstrate that DISCO outperforms the art by a sizable +8.6% success rate margin in unseen scenes, even without step-by-step instructions. Our code is publicly released at <a class="link-external link-https" href="https://github.com/AllenXuuu/DISCO" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to build an embodied AI agent for intelligent home assistants that can perform long - term household tasks in complex, unstructured real - world environments, enabling it to complete navigation and interaction tasks according to human instructions. Specifically, the research focuses on basic mobile operation tasks, that is, how to navigate and interact based on given verb - noun pairs. ### Core Problems of the Paper 1. **Task Planning**: How to parse human instructions and formulate corresponding plans. 2. **Environmental Modeling**: How to perceive the surrounding environment, locate semantic entities, and navigate to target locations. 3. **Object Interaction**: How to effectively interact with objects, especially in the absence of step - by - step instructions. ### DISCO's Solutions To solve the above problems, the paper proposes DISCO (DIfferentiable Scene Semantics and Dual - level COntrol), and its main contributions include: 1. **Differentiable Scene Semantic Representation**: - DISCO learns dynamic scene representations that contain rich object and functional semantics and can be updated in real - time. - Scene representations are optimized by gradient descent to match local point - cloud semantics. - Use the formula \( p_{i,j} = f(s_i, q_j) = \sigma(s_i^T q_j) \), where \( s_i \) is the scene representation of the \( i \) - th grid, \( q_j \) is the \( j \) - th semantic query vector, and \(\sigma\) is the sigmoid function. 2. **Two - level Coarse - and - Fine Control**: - **Coarse Control**: Navigate based on global cues (such as scene maps) to make the agent approach the target object. - **Fine Control**: Make fine - tuned adjustments based on local visual frames (such as RGB images, depth estimates, and object masks) to achieve efficient object interaction. - Coarse control relies on the global semantic map to approach the target, while fine control uses neural policies to adjust postures and operate on objects according to local visual information. 3. **Application on the ALFRED Benchmark**: - Extensive experiments were carried out on the ALFRED benchmark, showing that DISCO significantly outperforms existing methods in unseen scenes, especially in the absence of step - by - step instructions. - Experimental results show that in unseen scenes, the success rate of DISCO is 8.6% higher than that of the best existing method. ### Summary By introducing the differentiable scene semantic representation and the two - level control mechanism, DISCO has solved the challenges of navigation and interaction according to human instructions in complex environments, especially performing exceptionally well in the absence of step - by - step guidance. This provides new ideas and technical means for building general - purpose home assistant agents.