Weixin Mao,Weiheng Zhong,Zhou Jiang,Dong Fang,Zhongyue Zhang,Zihan Lan,Fan Jia,Tiancai Wang,Haoqiang Fan,Osamu Yoshie
Abstract:Existing policy learning methods predominantly adopt the task-centric paradigm, necessitating the collection of task data in an end-to-end manner. Consequently, the learned policy tends to fail to tackle novel tasks. Moreover, it is hard to localize the errors for a complex task with multiple stages due to end-to-end learning. To address these challenges, we propose RoboMatrix, a skill-centric and hierarchical framework for scalable task planning and execution. We first introduce a novel skill-centric paradigm that extracts the common meta-skills from different complex tasks. This allows for the capture of embodied demonstrations through a kill-centric approach, enabling the completion of open-world tasks by combining learned meta-skills. To fully leverage meta-skills, we further develop a hierarchical framework that decouples complex robot tasks into three interconnected layers: (1) a high-level modular scheduling layer; (2) a middle-level skill layer; and (3) a low-level hardware layer. Experimental results illustrate that our skill-centric and hierarchical framework achieves remarkable generalization performance across novel objects, scenes, tasks, and embodiments. This framework offers a novel solution for robot task planning and execution in open-world scenarios. Our software and hardware are available at <a class="link-external link-https" href="https://github.com/WayneMao/RoboMatrix" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to solve several key problems in existing robot task planning and execution methods:
1. **Low data collection efficiency**: Existing task - based methods need to collect end - to - end data for each new task, which is very time - consuming and resource - intensive when dealing with complex tasks.
2. **Poor generalization ability for new tasks**: Due to the limitations of end - to - end learning, these methods perform poorly when facing unseen new tasks and cannot generate new action sequences.
3. **Difficult error location**: Due to the black - box nature of end - to - end learning, it is difficult to determine at which stage the error occurs, especially in multi - stage complex tasks.
To overcome these problems, the paper proposes **RoboMatrix**, a skill - centered hierarchical framework for scalable robot task planning and execution in the open world. This framework improves data collection efficiency, task generalization ability, and the convenience of error location by extracting common meta - skills in different complex tasks and combining them to complete new tasks.
### Main contributions
1. **Introduced a skill - centered hierarchical framework**: This framework can achieve scalable robot task planning and execution in open - world scenarios.
2. **Proposed a unified vision - language - action (VLA) model**: This model can perform both robot movement and manipulation tasks simultaneously.
3. **Demonstrated strong generalization ability on new objects, scenes, tasks, and robot morphologies**.
### Method overview
1. **Skill - centered pipeline**:
- **Meta - skill extraction**: Extract common meta - skills from different complex tasks and build a skill matrix.
- **Skill database**: Continuously optimize and expand the skill database by collecting and organizing skill data.
2. **Skill model**:
- **Vision - language - action (VLA) model**: Based on pre - trained language models (such as Vicuna 1.5), combined with a visual encoder and an action generation module, to achieve end - to - end task execution.
- **Hybrid model**: Used to handle tasks in unstructured environments, such as object grasping and searching, combining traditional control methods (such as PD control) and modern detection algorithms (such as YOLOWorld).
3. **RoboMatrix framework**:
- **Modular scheduling layer**: Responsible for decomposing complex tasks into sub - task sequences and scheduling execution according to the feedback of the skill model.
- **Skill layer**: Map sub - task descriptions to specific robot actions, including stop signals to determine whether the current sub - task is completed.
- **Hardware layer**: Manage the robot's controller and state observer, convert actions into control signals, and update the robot's state and image in real - time.
### Experimental results
The paper verified the effectiveness of the RoboMatrix framework through a series of experiments:
1. **Meta - skill performance evaluation**: A comprehensive evaluation was carried out on eight meta - skills, and the results showed that the model performed well on both seen and unseen objects and scenes.
2. **Task - level generalization performance**: Through a five - level generalization evaluation protocol, the generalization ability of the model on tasks and scenes of different difficulties was verified. The results showed that the skill - centered method was significantly superior to the task - centered method when dealing with complex tasks.
3. **Cross - robot - morphology generalization**: The model was directly deployed on different types of robots to verify its adaptability on new robots.
### Conclusion
By introducing a skill - centered hierarchical framework, RoboMatrix effectively solves the deficiencies of existing methods in data collection efficiency, task generalization ability, and error location, providing a new solution for robot task planning and execution in the open world.