Abstract:In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high precision. We study how to make them more effective, precise, and fast. Using a combination of architectural and system-level improvements, we propose RVT-2, a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT. RVT-2 achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%. RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations. Visual results, code, and trained model are provided at: <a class="link-external link-https" href="https://robotic-view-transformer-2.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper studies how to build a robot system that can complete various 3D manipulation tasks based on natural language instructions. It focuses on the ability to learn new tasks accurately with only a small number of demonstrations. The system needs to have three key features: handling multiple tasks, requiring a small number of demonstrations, and high-precision task-solving capability. Previous work such as PerAct and RVT has made progress in this field, but they perform poorly in tasks that require high precision. The authors of the paper propose RVT-2, an improved version of RVT, through architectural and system-level improvements. RVT-2 is a multi-task 3D manipulation model that trains 6 times faster, infers 2 times faster, and achieves a 15 percentage point improvement in success rate on the RLBench benchmark test. With only about 10 demonstrations, RVT-2 can learn and execute high-precision tasks in the real world, such as plug insertion and pin insertion. The main improvements of RVT-2 include: using a multi-stage inference pipeline to predict more accurate end-effector poses, adopting convex upsampling techniques to save GPU memory and improve speed, and utilizing positional conditioning features to improve end-effector rotation prediction. In addition, system-level optimizations include creating a custom virtual image renderer to accelerate rendering and reduce memory usage, and applying practices of training transformer models such as mixed-precision training. The paper also compares related work, such as RVT with multi-view virtual images, highlighting the advantages of RVT-2 in high-precision manipulation tasks, as well as the differences with other methods that learn from a small number of examples (e.g., MimicPlay). In conclusion, through comprehensive improvements, RVT-2 enhances the efficiency and accuracy of 3D manipulation, taking an important step towards achieving a general-purpose robot system based on few-shot learning.

RVT-2: Learning Precise Manipulation from Few Demonstrations

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach

R3M: A Universal Visual Representation for Robot Manipulation

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

VIRT: Vision Instructed Transformer for Robotic Manipulation

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Learning Generalizable 3D Manipulation With 10 Demonstrations

VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation

Learning Robotic Manipulation through Visual Planning and Acting

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Q-Attention: Enabling Efficient Learning for Vision-Based Robotic Manipulation

PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation