Abstract:Deep imitation learning is promising for solving dexterous manipulation tasks because it does not require an environment model and pre-programmed robot behavior. However, its application to dual-arm manipulation tasks remains challenging. In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions and results in poor performance of the neural networks. We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements. A Transformer, a variant of self-attention architecture, is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world. The proposed method has been tested on dual-arm manipulation tasks using a real robot. The experimental results demonstrated that the Transformer-based deep imitation learning architecture can attend to the important features among the sensory inputs, therefore reducing distractions and improving manipulation performance when compared with the baseline architecture without the self-attention mechanisms.

What problem does this paper attempt to address?

This paper attempts to solve a key problem in dual - arm robot manipulation tasks: in deep imitation learning, the additional robotic arm increases the state dimension, resulting in a decline in the performance of neural networks. Specifically, when it comes to dual - arm cooperation, the increase in the state dimension will introduce interference, thus affecting the performance of the model. ### Problem Description In dual - arm robot manipulation tasks, using Deep Imitation Learning can avoid the need for environmental models and pre - programmed robot behaviors, which gives it great potential in dexterous manipulation tasks. However, when applied to dual - arm manipulation tasks, the additional robotic arm increases the state dimension, making the input information complex, and the neural network is vulnerable to interference, thus affecting performance. ### Solution To solve this problem, the author proposes a self - attention mechanism based on Transformer. Transformer is an architecture that can calculate the dependency relationships between elements of sequence inputs and is especially suitable for processing high - dimensional data. By applying Transformer, the model can focus on important input features, reduce interference, and thus improve the performance of dual - arm manipulation tasks. ### Main Contributions 1. **Introducing the Transformer Architecture**: Applying Transformer to deep imitation learning to handle high - dimensional state inputs in dual - arm robot manipulation tasks. 2. **Reducing Interference**: Through the self - attention mechanism, the model can focus on important perceptual inputs and reduce the interference of irrelevant information. 3. **Experimental Verification**: Through experiments on real robots, the effectiveness of the proposed Transformer - based deep imitation learning architecture in various dual - arm manipulation tasks, including non - coordinated tasks, target - coordinated tasks, and dual - arm cooperation tasks, has been verified. ### Experimental Results The experimental results show that the Transformer - based deep imitation learning method significantly outperforms the baseline model in multiple tasks, especially showing stronger robustness when dealing with high - dimensional state inputs. For example, in the BoxPush task, the Transformer - based method is significantly superior to the baseline model in both position error and orientation error. ### Summary This paper successfully solves the interference problem caused by the increase in the state dimension in dual - arm robot manipulation tasks by introducing the Transformer architecture, improving the performance and robustness of the model. This method is not only applicable to dual - arm robots but can also be extended to more complex multi - arm robots or humanoid robot systems.

Transformer-based deep imitation learning for dual-arm robot manipulation

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Goal-Conditioned Dual-Action Imitation Learning for Dexterous Dual-Arm Robot Manipulation

ILBiT: Imitation Learning for Robot Using Position and Torque Information based on Bilateral Control with Transformer

Memory-based gaze prediction in deep imitation learning for robot manipulation

Multi-task real-robot data with gaze attention for dual-arm fine manipulation

Bi-ACT: Bilateral Control-Based Imitation Learning via Action Chunking with Transformer

LfDT: Learning Dual-Arm Manipulation from Demonstration Translated from a Human and Robotic Arm

Leveraging Pretrained Latent Representations for Few-Shot Imitation Learning on a Dexterous Robotic Hand

Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation

InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation

Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer

Transformers for One-Shot Visual Imitation

Training Robots without Robots: Deep Imitation Learning for Master-to-Robot Policy Transfer

From One Hand to Multiple Hands: Imitation Learning for Dexterous Manipulation From Single-Camera Teleoperation

Verification of Learning Model for Dual-arm Cooperative Motion in Imitation Learning based on Bilateral Control

A Task-Adaptive Deep Reinforcement Learning Framework for Dual-Arm Robot Manipulation

DA-VIL: Adaptive Dual-Arm Manipulation with Reinforcement Learning and Variable Impedance Control

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Autonomous Dual-Arm Manipulation of Familiar Objects

A Dual-Arm Collaborative Framework for Dexterous Manipulation in Unstructured Environments with Contrastive Planning