Abstract:Autoregressive models have demonstrated remarkable success in natural language processing. In this work, we design a simple yet effective autoregressive architecture for robotic manipulation tasks. We propose the Chunking Causal Transformer (CCT), which extends the next-single-token prediction of causal transformers to support multi-token prediction in a single pass. Further, we design a novel attention interleaving strategy that allows CCT to be trained efficiently with teacher-forcing. Based on CCT, we propose the Autoregressive Policy (ARP) model, which learns to generate action sequences autoregressively. We find that action sequence learning enables better leverage of the underlying causal relationships in robotic tasks. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that it outperforms the state-of-the-art methods in all tested environments, while being more efficient in computation and parameter sizes. Video demonstrations, our source code, and the models of ARP can be found at <a class="link-external link-http" href="http://github.com/mlzxy/arp" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to design an effective autoregressive architecture to generate action sequences in robotic manipulation tasks?** Specifically, the author aims to improve the manipulation performance of robots in different environments by introducing a new model - Chunking Causal Transformer (CCT). This model can support multi - token prediction and adopts a novel attention interleaving strategy during training, thus improving computational efficiency and performance. ### Specific description of the problem 1. **Limitations of existing methods**: - Current Decision Transformer (DT) and Trajectory Transformer (TT) are mainly applied to tasks where low - dimensional state variables are fully observable. - These methods are difficult to handle high - dimensional observations (such as images or point clouds) and unknown or unclear reward functions, which are very common in robotic manipulation tasks. 2. **Challenges to be addressed**: - **Efficient generation of action sequences**: Robot tasks usually require high - frequency control, so a method that can predict multiple future actions in a single inference is needed. - **Handling causal dependencies**: There are logical, spatial and temporal causal dependencies in robot tasks, and a model that can effectively capture these dependencies is required. - **Adapting to different task requirements**: Different robot tasks may require different types of actions, so a flexible framework is needed to represent and generate these actions. ### Proposed solutions 1. **Chunking Causal Transformer (CCT)**: - It extends the next - token prediction ability of the causal transformer, enabling it to predict multiple tokens (i.e., a set of actions) at once. - It introduces a new attention interleaving strategy, allowing CCT to be more efficient when trained with teacher - forcing. 2. **Autoregressive Policy (ARP)**: - It is built based on CCT and is used to learn to generate action sequences autoregressively. - ARP can perform excellently in different types of robotic manipulation environments, including Push - T, ALOHA and RLBench. 3. **Experimental verification**: - The performance of ARP was evaluated in multiple robotic manipulation environments, and the results show that it is superior to the existing state - of - the - art (SoTA) methods specific to the environment, and is more efficient in terms of computational resources and the number of parameters. ### Summary The main contribution of this paper is to propose a new autoregressive architecture (CCT and ARP), which solves the problem of efficiently generating action sequences in robotic manipulation tasks, and verifies its superior performance in various complex environments through experiments.

Autoregressive Action Sequence Learning for Robotic Manipulation

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning

Q-Attention: Enabling Efficient Learning for Vision-Based Robotic Manipulation

Bi-ACT: Bilateral Control-Based Imitation Learning via Action Chunking with Transformer

Scaling Manipulation Learning with Visual Kinematic Chain Prediction

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

RLAfford: End-to-End Affordance Learning for Robotic Manipulation

Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

Learning Robotic Manipulation through Visual Planning and Acting

Spatial-Language Attention Policies for Efficient Robot Learning

PRRM: An Efficient Framework for Learning Multi-step Robotic Manipulation Tasks

CAGE: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation

PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training

Efficient Robot Skill Learning with Imitation from a Single Video for Contact-Rich Fabric Manipulation

VQ-ACE: Efficient Policy Search for Dexterous Robotic Manipulation via Action Chunking Embedding

Adversarial Skill Chaining for Long-Horizon Robot Manipulation via Terminal State Regularization

Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation

Learning Manipulation by Predicting Interaction

Safety Guaranteed Manipulation Based on Reinforcement Learning Planner and Model Predictive Control Actor