Abstract:Embodied agents capable of complex physical skills can improve productivity, elevate life quality, and reshape human-machine collaboration. We aim at autonomous training of embodied agents for various tasks involving mainly large foundation models. It is believed that these models could act as a brain for embodied agents; however, existing methods heavily rely on humans for task proposal and scene customization, limiting the learning autonomy, training efficiency, and generalization of the learned policies. In contrast, we introduce a brain-body synchronization ({\it BBSEA}) scheme to promote embodied learning in unknown environments without human involvement. The proposed combines the wisdom of foundation models (``brain'') with the physical capabilities of embodied agents (``body''). Specifically, it leverages the ``brain'' to propose learnable physical tasks and success metrics, enabling the ``body'' to automatically acquire various skills by continuously interacting with the scene. We carry out an exploration of the proposed autonomous learning scheme in a table-top setting, and we demonstrate that the proposed synchronization can generate diverse tasks and develop multi-task policies with promising adaptability to new tasks and configurations. We will release our data, code, and trained models to facilitate future studies in building autonomously learning agents with large foundation models in more complex scenarios. More visualizations are available at \href{https://bbsea-embodied-ai.github.io}{https://bbsea-embodied-ai.github.io}

What problem does this paper attempt to address?

The paper attempts to address the problem of how to autonomously train embodied agents with complex physical skills in unknown environments, particularly with the assistance of large foundation models (LFMs), to reduce reliance on human intervention, improve learning efficiency, and enhance the generalization ability of strategies. Specifically, the paper proposes a Brain-Body Synchronization (BBSEA) scheme, aiming to achieve autonomous learning without human involvement by combining the intelligence of foundation models ("brain") and the physical capabilities of embodied agents ("body"). ### Main Issues 1. **Reducing Human Intervention**: Existing methods heavily rely on human intervention for task proposal and scene customization, which limits the autonomy of learning, training efficiency, and the generalization ability of strategies. 2. **Improving Learning Efficiency and Generalization Ability**: In unknown environments, how to efficiently train embodied agents to adapt to new tasks and configurations. 3. **Achieving Multi-Task Strategies**: How to generate diverse tasks through autonomous learning and develop well-adapted multi-task strategies. ### Solution The paper proposes a framework to achieve brain-body synchronization through the following three key steps: 1. **Task Proposal**: The foundation model ("brain") proposes interactive tasks based on the scene and the physical constraints of the embodied agent. 2. **Task Completion Inference**: The foundation model defines success metrics for tasks, helping the embodied agent determine whether the task has been successfully executed. 3. **Strategy Learning Under Task Conditions**: The embodied agent acquires skills through continuous interaction with the environment (trial and error) and learns strategies under task conditions based on feedback. ### Experimental Validation The paper validates the proposed framework through experiments in a tabletop manipulation environment, demonstrating its effectiveness in terms of task diversity, feasibility of task proposals, accuracy of task completion inference, and the effectiveness and generalization ability of multi-task strategies. Experimental results show that the BBSEA framework can generate diverse and human-understandable tasks and exhibits high reliability and accuracy in task proposal and success inference. ### Contributions 1. **Autonomous Learning Framework**: Proposes a framework that combines foundation models with embodied agents to achieve autonomous learning in unknown environments. 2. **Task Proposal Module**: Develops an efficient scene understanding module that can automatically propose tasks compatible with the scene and establish evaluation criteria for task completion. 3. **Strategy Learning Validation**: Validates the adaptability of learned strategies to new tasks and configurations through zero-shot and few-shot settings.

BBSEA: An Exploration of Brain-Body Synchronization for Embodied Agents

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

Continual Developmental Neurosimulation Using Embodied Computational Agents

Multimodal Embodied Interactive Agent for Cafe Scene

EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents

An Embodied Generalist Agent in 3D World

Embodied Multi-Agent Task Planning from Ambiguous Instruction

Build generally reusable agent-environment interaction models

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

Embodied Executable Policy Learning with Language-based Scene Summarization

ASC me to Do Anything: Multi-task Training for Embodied AI

Learning body models: from humans to humanoids

Towards Embodied Scene Description

ARBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild

Body of Her: A Preliminary Study on End-to-End Humanoid Agent

Embodied Instruction Following in Unknown Environments

Scene Augmentation Methods for Interactive Embodied AI Tasks

Explore until Confident: Efficient Exploration for Embodied Question Answering

ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models