Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi,Zhenjia Xu,Chuer Pan,Eric Cousineau,Benjamin Burchfiel,Siyuan Feng,Russ Tedrake,Shuran Song
2024-03-06
Abstract:We present Universal Manipulation Interface (UMI) -- a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, low-cost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI's versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at
Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the problem of directly training robots to perform complex operational skills through human demonstrations in the wild environment. Specifically, the paper focuses on how to design a portable, low - cost and information - rich human demonstration data collection framework, so that these data can be effectively transferred to robot strategies to achieve the operational ability for dynamic, dual - arm coordination, precision and long - time tasks. ### Main problem points: 1. **Limitations of existing methods**: - **Tele - operation**: Although it can be directly transferred to robots, it requires expensive hardware devices and professional operators, which limits its application in the wild environment. - **Human videos**: Although they provide rich visual diversity, due to the embodiment gap between humans and robots, it is difficult to directly transfer actions. 2. **Specific challenges**: - **Insufficient visual context**: Using a wrist - mounted camera limits the visual coverage of the scene, resulting in a lack of sufficient visual information during action planning. - **Imprecise actions**: Most handheld devices rely on monocular structured light (SfM) to recover robot actions, but this method is difficult to provide precise actions in cases of scale ambiguity, motion blur or insufficient texture. - **Delay differences**: During data collection, there is no delay in observation and action recording, but during inference, various delays within the system (such as sensor delay, inference delay and execution delay) will cause the input data to be out of distribution and generate asynchronous actions. - **Insufficient policy representation**: Previous works usually use simple policy representations (such as MLP) and action regression losses, which limit their ability to capture the complex multi - modal action distributions in human data. ### Solutions: 1. **Physical interface design**: - Use a wide - angle fisheye lens to increase the field of view and visual context. - Add side mirrors on the gripper to provide implicit stereo observation. - Utilize the built - in IMU sensor of GoPro for robust tracking, maintaining high precision even during rapid motion. 2. **Policy interface design**: - **Inference - time delay matching**: Process the delays of different observation streams to ensure the time synchronization of observation data in the actual robot system. - **Relative - trajectory action representation**: Represent actions using relative trajectories relative to the current gripper end - effector (EE) position to improve the robustness of the system. - **Relative end - effector pose**: Represent the historical EE pose as a relative trajectory, provide velocity information, and make the system calibration - free. - **Relative pose between dual - arms**: In a dual - arm setup, provide relative pose information between the two grippers to achieve better dual - arm coordination. ### Experimental verification: The paper verifies the effectiveness of the UMI framework through extensive experiments, demonstrating its zero - sample generalization ability in a variety of tasks, including dynamic, dual - arm coordination, precision and long - time tasks. The experimental results show that the UMI framework can successfully convert human demonstration data into effective strategies that can be deployed on different robot platforms. ### Summary: The UMI framework solves the challenges of existing methods in training robots to perform complex operational skills in the wild environment through carefully designed data collection and policy interfaces, achieving effective transfer from human demonstrations to robot strategies.