FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation

Minho Heo,Youngwoon Lee,Doohyun Lee,Joseph J. Lim
2023-05-22
Abstract:Reinforcement learning (RL), imitation learning (IL), and task and motion planning (TAMP) have demonstrated impressive performance across various robotic manipulation tasks. However, these approaches have been limited to learning simple behaviors in current real-world manipulation benchmarks, such as pushing or pick-and-place. To enable more complex, long-horizon behaviors of an autonomous robot, we propose to focus on real-world furniture assembly, a complex, long-horizon robot manipulation task that requires addressing many current robotic manipulation challenges to solve. We present FurnitureBench, a reproducible real-world furniture assembly benchmark aimed at providing a low barrier for entry and being easily reproducible, so that researchers across the world can reliably test their algorithms and compare them against prior work. For ease of use, we provide 200+ hours of pre-collected data (5000+ demonstrations), 3D printable furniture models, a robotic environment setup guide, and systematic task initialization. Furthermore, we provide FurnitureSim, a fast and realistic simulator of FurnitureBench. We benchmark the performance of offline RL and IL algorithms on our assembly tasks and demonstrate the need to improve such algorithms to be able to solve our tasks in the real world, providing ample opportunities for future research.
Robotics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of overly simple and short-term tasks in current robotic manipulation benchmarks. Specifically, the paper proposes a new benchmark—FurnitureBench, which focuses on real-world furniture assembly tasks. Furniture assembly is a complex and long-term robotic manipulation task that requires solving many current challenges in robotic manipulation, such as long-term planning, dexterous control, and visual perception. Through this benchmark, researchers can more reliably test and compare their algorithms, thereby advancing the development of robotics in solving everyday tasks. ### Main Contributions 1. **Introduction of FurnitureBench**: A real-world furniture assembly benchmark that allows robotic researchers to study Reinforcement Learning (RL), Imitation Learning (IL), and Task and Motion Planning (TAMP) algorithms on realistic and complex tasks. 2. **Ensuring Reproducibility**: By providing 3D-printed furniture parts and detailed environment setup guidelines, ensuring that any lab can easily establish a standardized robotic environment. 3. **Collection of Extensive Demonstration Data**: Collected over 200 hours of large-scale teleoperation demonstration data, reducing practical experimental barriers for offline RL and IL algorithms. 4. **Evaluation of Single-Skill Subtasks**: Evaluated various skills required for furniture assembly in single-skill benchmarks, such as grasping, inserting, and screwing, identifying the challenges in learning these skills. 5. **Full Assembly Task Evaluation**: Evaluated the complete furniture assembly task in full assembly benchmarks, where current IL and offline RL methods can only complete an average of 2 out of 12 subtasks. 6. **Development of FurnitureSim**: Developed a simulator that can accelerate experimental iterations of new methods. ### Background and Motivation - **Limitations of Existing Benchmarks**: Existing robotic manipulation benchmarks mostly focus on simple and short-term tasks, such as pushing objects or pick-and-place. These tasks cannot comprehensively evaluate complex, long-term robotic manipulation capabilities. - **Real-World Needs**: Real-world physical tasks, such as doing laundry, tidying up a room, cooking, and assembling furniture, require robots to understand the environment, plan, and execute complex long-term behaviors. These activities span long periods, involve diverse semantic combination behaviors, and require dexterous and precise manipulation skills. - **Necessity of Complex Tasks**: To further advance robotics technology in solving everyday tasks, it is necessary to tackle more complex and long-term tasks. This requires reproducible benchmarks to ensure the reliability and comparability of research results. ### Methods and Experiments - **System Design**: FurnitureBench includes a 7-degree-of-freedom Franka Emika Panda robotic arm, three Intel RealSense D435 RGB-D cameras, and 3D-printed furniture models. Detailed environment setup instructions and software tools allow users to construct new environments based on the pose estimation of furniture and tables. - **Reproducibility Analysis**: Verified the reproducibility of the benchmark by having 10 participants set up the environment from scratch. Results showed that performance in new environments remained between 75-93% compared to the original environment. - **Observation and Robot Control**: The environment observation space includes front-view and wrist camera inputs as well as the robot's proprioceptive state. Incremental pose control of the end-effector is achieved using operational space control. - **Task Initialization Tool**: Provided a graphical user interface tool that guides users to match the initial poses of furniture parts with target configurations from predefined distributions, ensuring comparability of experimental results across different users. - **Large-Scale Demonstration Dataset**: Collected 219.6 hours of successful demonstration data using Oculus Quest 2 controllers and a keyboard, used for training IL and offline RL methods. - **Simulated Environment**: Provided the FurnitureSim simulator, which can quickly verify the correctness and performance of algorithms. ### Results - **Single-Skill Benchmark**: Individual skill policies successfully learned "grasping" and "placing" skills, but the success rates for "inserting" and "screwing" skills were low, ranging from 0% to 20%. - **Full Assembly Task**: IL and offline RL methods could only complete an average of 2 out of 12 subtasks in the full assembly benchmark. - **Simulated Environment and Real Environment**: