Abstract:Deep imitation learning is promising for solving dexterous manipulation tasks because it does not require an environment model and pre-programmed robot behavior. However, its application to dual-arm manipulation tasks remains challenging. In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions and results in poor performance of the neural networks. We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements. A Transformer, a variant of self-attention architecture, is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world. The proposed method has been tested on dual-arm manipulation tasks using a real robot. The experimental results demonstrated that the Transformer-based deep imitation learning architecture can attend to the important features among the sensory inputs, therefore reducing distractions and improving manipulation performance when compared with the baseline architecture without the self-attention mechanisms.
What problem does this paper attempt to address?
This paper attempts to solve a key problem in dual - arm robot manipulation tasks: in deep imitation learning, the additional robotic arm increases the state dimension, resulting in a decline in the performance of neural networks. Specifically, when it comes to dual - arm cooperation, the increase in the state dimension will introduce interference, thus affecting the performance of the model.
### Problem Description
In dual - arm robot manipulation tasks, using Deep Imitation Learning can avoid the need for environmental models and pre - programmed robot behaviors, which gives it great potential in dexterous manipulation tasks. However, when applied to dual - arm manipulation tasks, the additional robotic arm increases the state dimension, making the input information complex, and the neural network is vulnerable to interference, thus affecting performance.
### Solution
To solve this problem, the author proposes a self - attention mechanism based on Transformer. Transformer is an architecture that can calculate the dependency relationships between elements of sequence inputs and is especially suitable for processing high - dimensional data. By applying Transformer, the model can focus on important input features, reduce interference, and thus improve the performance of dual - arm manipulation tasks.
### Main Contributions
1. **Introducing the Transformer Architecture**: Applying Transformer to deep imitation learning to handle high - dimensional state inputs in dual - arm robot manipulation tasks.
2. **Reducing Interference**: Through the self - attention mechanism, the model can focus on important perceptual inputs and reduce the interference of irrelevant information.
3. **Experimental Verification**: Through experiments on real robots, the effectiveness of the proposed Transformer - based deep imitation learning architecture in various dual - arm manipulation tasks, including non - coordinated tasks, target - coordinated tasks, and dual - arm cooperation tasks, has been verified.
### Experimental Results
The experimental results show that the Transformer - based deep imitation learning method significantly outperforms the baseline model in multiple tasks, especially showing stronger robustness when dealing with high - dimensional state inputs. For example, in the BoxPush task, the Transformer - based method is significantly superior to the baseline model in both position error and orientation error.
### Summary
This paper successfully solves the interference problem caused by the increase in the state dimension in dual - arm robot manipulation tasks by introducing the Transformer architecture, improving the performance and robustness of the model. This method is not only applicable to dual - arm robots but can also be extended to more complex multi - arm robots or humanoid robot systems.