RVT-2: Learning Precise Manipulation from Few Demonstrations

Ankit Goyal,Valts Blukis,Jie Xu,Yijie Guo,Yu-Wei Chao,Dieter Fox
2024-06-13
Abstract:In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high precision. We study how to make them more effective, precise, and fast. Using a combination of architectural and system-level improvements, we propose RVT-2, a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT. RVT-2 achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%. RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations. Visual results, code, and trained model are provided at: <a class="link-external link-https" href="https://robotic-view-transformer-2.github.io/" rel="external noopener nofollow">this https URL</a>.
Robotics,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper studies how to build a robot system that can complete various 3D manipulation tasks based on natural language instructions. It focuses on the ability to learn new tasks accurately with only a small number of demonstrations. The system needs to have three key features: handling multiple tasks, requiring a small number of demonstrations, and high-precision task-solving capability. Previous work such as PerAct and RVT has made progress in this field, but they perform poorly in tasks that require high precision. The authors of the paper propose RVT-2, an improved version of RVT, through architectural and system-level improvements. RVT-2 is a multi-task 3D manipulation model that trains 6 times faster, infers 2 times faster, and achieves a 15 percentage point improvement in success rate on the RLBench benchmark test. With only about 10 demonstrations, RVT-2 can learn and execute high-precision tasks in the real world, such as plug insertion and pin insertion. The main improvements of RVT-2 include: using a multi-stage inference pipeline to predict more accurate end-effector poses, adopting convex upsampling techniques to save GPU memory and improve speed, and utilizing positional conditioning features to improve end-effector rotation prediction. In addition, system-level optimizations include creating a custom virtual image renderer to accelerate rendering and reduce memory usage, and applying practices of training transformer models such as mixed-precision training. The paper also compares related work, such as RVT with multi-view virtual images, highlighting the advantages of RVT-2 in high-precision manipulation tasks, as well as the differences with other methods that learn from a small number of examples (e.g., MimicPlay). In conclusion, through comprehensive improvements, RVT-2 enhances the efficiency and accuracy of 3D manipulation, taking an important step towards achieving a general-purpose robot system based on few-shot learning.