Abstract:Relying on multi-modal observations, embodied robots could perform multiple robotic manipulation tasks in unstructured real-world environments. However, most language-conditioned behavior-cloning agents still face existing long-standing challenges, i.e., 3D scene representation and human-level task learning, when adapting into new sequential tasks in practical scenarios. We here investigate these above challenges with NBAgent in embodied robots, a pioneering language-conditioned Never-ending Behavior-cloning Agent. It can continually learn observation knowledge of novel 3D scene semantics and robot manipulation skills from skill-shared and skill-specific attributes, respectively. Specifically, we propose a skill-sharedsemantic rendering module and a skill-shared representation distillation module to effectively learn 3D scene semantics from skill-shared attribute, further tackling 3D scene representation overlooking. Meanwhile, we establish a skill-specific evolving planner to perform manipulation knowledge decoupling, which can continually embed novel skill-specific knowledge like human from latent and low-rank space. Finally, we design a never-ending embodied robot manipulation benchmark, and expensive experiments demonstrate the significant performance of our method. Visual results, code, and dataset are provided at: <a class="link-external link-https" href="https://neragent.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to address the problem of enabling embodied robots to perform various robotic manipulation tasks in unstructured real-world environments under multimodal observations, particularly how to tackle long-standing challenges such as 3D scene representation and human-level task learning in new sequential tasks. Specifically, the paper proposes a pioneering language-conditioned Never-Ending Behavior Cloning Agent (NBAgent), aimed at continuously learning new 3D scene semantics and robotic manipulation skills through skill sharing and skill-specific attributes, thereby overcoming the resource limitations and forgetting issues of existing methods when dealing with new skills and complex objects. ### Core Issues of the Paper 1. **3D Scene Representation**: Existing language-conditioned behavior cloning agents often fail to effectively represent complex 3D scenes when handling new tasks, leading to limited understanding and execution capabilities for new tasks. 2. **Human-Level Task Learning**: Robots need to continuously learn new manipulation skills in ever-changing environments like humans, without forgetting the skills they have already learned. 3. **Resource Limitations**: Existing methods consume a lot of resources when dealing with new skills and complex objects, making it difficult to implement in practical applications. ### Solution The paper proposes a novel agent named NBAgent to address the above issues through the following modules: 1. **Skill Sharing Semantic Rendering Module (SSR)**: By using the NeRF model and visual foundation model, it transfers skill-sharing semantics from 2D space to 3D space, thereby overcoming the shortcomings in 3D scene understanding. 2. **Skill Sharing Representation Distillation Module (SRD)**: Through knowledge distillation technology, it aligns the 3D voxel representations between the old model and the current model, reducing catastrophic forgetting. 3. **Skill-Specific Evolution Planner (SEP)**: By decoupling skill-sharing and skill-specific knowledge in latent space and low-rank space, it achieves the learning of new skills while reducing the forgetting of old skills. ### Main Contributions 1. **First Exploration of a Practically Challenging Problem**: Proposed the Never-Ending Behavior Cloning Robot Learning (NBRL) problem and designed NBAgent to solve the skill learning problem in 3D scenes. 2. **Developed Skill Sharing Semantic Rendering Module and Skill Sharing Representation Distillation Module**: These modules can capture 3D semantic knowledge, overcoming the issue of 3D reasoning neglect in continuous learning. 3. **Designed Skill-Specific Evolution Planner**: By learning skill-specific knowledge in latent space and low-rank space, it achieves effective learning of new skills and reduces the forgetting of old skills. ### Experimental Validation The paper conducted experiments on two benchmark datasets (Kitchen and Living Room) to verify the effectiveness and robustness of NBAgent. The experimental results show that NBAgent can continuously learn while effectively reducing the forgetting of old skills when dealing with new skills and complex objects. ### Conclusion By proposing NBAgent, this paper addresses the challenges of 3D scene representation and human-level task learning in existing behavior cloning agents when handling new tasks, providing a new solution for continuous learning of robots in unstructured environments.

Never-Ending Behavior-Cloning Agent for Robotic Manipulation

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Learning Neuro-symbolic Programs for Language Guided Robot Manipulation

RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking

Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy

Explainable Behavior Cloning: Teaching Large Language Model Agents through Learning by Demonstration

One ACT Play: Single Demonstration Behavior Cloning with Action Chunking Transformers

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Constrained Behavior Cloning for Robotic Learning

Robust and High-Precision End-to-End Control Policy for Multi-stage Manipulation Task with Behavioral Cloning.

ARCADE: Scalable Demonstration Collection and Generation via Augmented Reality for Imitation Learning

Online Continual Learning For Interactive Instruction Following Agents

Continual Skill and Task Learning via Dialogue

DeMoBot: Deformable Mobile Manipulation with Vision-based Sub-goal Retrieval

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

Behavior Cloning and Replay of Humanoid Robot Via a Depth Camera

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation