Autoregressive Action Sequence Learning for Robotic Manipulation

Xinyu Zhang,Yuhan Liu,Haonan Chang,Liam Schramm,Abdeslam Boularias
2024-10-12
Abstract:Autoregressive models have demonstrated remarkable success in natural language processing. In this work, we design a simple yet effective autoregressive architecture for robotic manipulation tasks. We propose the Chunking Causal Transformer (CCT), which extends the next-single-token prediction of causal transformers to support multi-token prediction in a single pass. Further, we design a novel attention interleaving strategy that allows CCT to be trained efficiently with teacher-forcing. Based on CCT, we propose the Autoregressive Policy (ARP) model, which learns to generate action sequences autoregressively. We find that action sequence learning enables better leverage of the underlying causal relationships in robotic tasks. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that it outperforms the state-of-the-art methods in all tested environments, while being more efficient in computation and parameter sizes. Video demonstrations, our source code, and the models of ARP can be found at <a class="link-external link-http" href="http://github.com/mlzxy/arp" rel="external noopener nofollow">this http URL</a>.
Robotics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to design an effective autoregressive architecture to generate action sequences in robotic manipulation tasks?** Specifically, the author aims to improve the manipulation performance of robots in different environments by introducing a new model - Chunking Causal Transformer (CCT). This model can support multi - token prediction and adopts a novel attention interleaving strategy during training, thus improving computational efficiency and performance. ### Specific description of the problem 1. **Limitations of existing methods**: - Current Decision Transformer (DT) and Trajectory Transformer (TT) are mainly applied to tasks where low - dimensional state variables are fully observable. - These methods are difficult to handle high - dimensional observations (such as images or point clouds) and unknown or unclear reward functions, which are very common in robotic manipulation tasks. 2. **Challenges to be addressed**: - **Efficient generation of action sequences**: Robot tasks usually require high - frequency control, so a method that can predict multiple future actions in a single inference is needed. - **Handling causal dependencies**: There are logical, spatial and temporal causal dependencies in robot tasks, and a model that can effectively capture these dependencies is required. - **Adapting to different task requirements**: Different robot tasks may require different types of actions, so a flexible framework is needed to represent and generate these actions. ### Proposed solutions 1. **Chunking Causal Transformer (CCT)**: - It extends the next - token prediction ability of the causal transformer, enabling it to predict multiple tokens (i.e., a set of actions) at once. - It introduces a new attention interleaving strategy, allowing CCT to be more efficient when trained with teacher - forcing. 2. **Autoregressive Policy (ARP)**: - It is built based on CCT and is used to learn to generate action sequences autoregressively. - ARP can perform excellently in different types of robotic manipulation environments, including Push - T, ALOHA and RLBench. 3. **Experimental verification**: - The performance of ARP was evaluated in multiple robotic manipulation environments, and the results show that it is superior to the existing state - of - the - art (SoTA) methods specific to the environment, and is more efficient in terms of computational resources and the number of parameters. ### Summary The main contribution of this paper is to propose a new autoregressive architecture (CCT and ARP), which solves the problem of efficiently generating action sequences in robotic manipulation tasks, and verifies its superior performance in various complex environments through experiments.